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' Abstract 

Conformal prediction uses past experience to determine precise levels 

Oof confidence in new predictions. Given an error probability e, together 
with a method that makes a prediction j/ of a label y, it produces a set of 
labels, typically containing y, that also contains y with probability 1 — e. 
C/3 Conformal prediction can be applied to any method for producing y: a 

nearest-neighbor method, a support-vector machine, ridge regression, etc. 

Conformal prediction is designed for an on-line setting in which labels 
are predicted successively, each one being revealed before the next is pre- 
^ ' dieted. The most novel and valuable feature of conformal prediction is 

00 ' that if the successive examples are sampled independently from the same 

00 . distribution, then the successive predictions will be right 1 — e of the time, 

even though they are based on an accumulating dataset rather than on 
independent datasets. 

. In addition to the model under which successive examples are sampled 

' independently, other on-line compression models can also use conformal 

, prediction. The widely used Gaussian linear model is one of these. 

' This tutorial presents a self-contained account of the theory of confor- 

mal prediction and works through several numerical examples. A more 
comprehensive treatment of the topic is provided in Algorithmic Learning 
' in a Random World, by Vladimir Vovk, Alex Gammerman, and Glenn 

H ' Shafer (Springer, 2005). 

1 Introduction 



How good is your prediction yl If you are predicting the label y of a new object, 
how confident are you that y = yl If the label y is a number, how close do you 
think it is to y? In machine learning, these questions are usually answered in a 
fairly rough way from past experience. We expect new predictions to fare about 
as well as past predictions. 

Conformal prediction uses past experience to determine precise levels of con- 
fidence in predictions. Given a method for making a prediction y, conformal 
prediction produces a 95% prediction region — a set r° °^ that contains y with 
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probability at least 95%. Typically r°"^ also contains the prediction y. We 
call y the point prediction, and we call r*^"'^ the region prediction. In the case 
of regression, where y is a number, r° °^ is typically an interval around y. In 
the case of classification, where y has a limited number of possible values, r""^ 
may consist of a few of these values or, in the ideal case, just one. 

Conformal prediction can be used with any method of point prediction for 
classification or regression, including support-vector machines, decision trees, 
boosting, neural networks, and Bayesian prediction. Starting from the method 
for point prediction, we construct a nonconformity measure, which measures 
how unusual an example looks relative to previous examples, and the conformal 
algorithm turns this nonconformity measure into prediction regions. 

Given a nonconformity measure, the conformal algorithm produces a predic- 
tion region F*^ for every probability of error e. The region is a {1 — e)- prediction 
region; it contains y with probability at least 1 — e. The regions for different 
e are nested: when ei > £2, so that 1 — ei is a lower level of confidence than 
1 — £2 , we have T^^ C Y^'^ . If contains only a single label (the ideal outcome 
in the case of classification), we may ask how small e can be made before we 
must enlarge by adding a second label; the corresponding value of 1 — e is 
the confidence we assert in the predicted label. 

As we explain in 33 the conformal algorithm is designed for an on-line 
setting, in which we predict the labels of objects successively, seeing each label 
after we have predicted it and before we predict the next one. Our prediction 
yn of the nth label ?/„ may use observed features Xn of the nth object and the 
preceding examples (2:1, yi), . . . , j/n-i). The size of the prediction region 

r*^ may also depend on these details. Readers most interested in implementing 
the conformal algorithm may wish to turn directly to the elementary examples 
in iJ4.2l and ij4.3l and then turn back to the earlier more general material as 
needed. 

As we explain in ^ the on-line picture leads to a new concept of validity 
for prediction with confidence. Classically, a method for finding 95% prediction 
regions was considered valid if it had a 95% probability of containing the label 
predicted, because by the law of the large numbers it would then be correct 95% 
of the time when repeatedly applied to independent datasets. But in the on-line 
picture, we repeatedly apply a method not to independent datasets but to an 
accumulating dataset. After using {xi,yi), . . . , (x„_i,?/„_i) and Xn to predict 
yn, we use (a:i, yi), . . . , (x„_i, y„_i), {xn, y„) and Xn+i to predict yn+i, and so 
on. For a 95% on-line method to be valid, 95% of these predictions must be 
correct. Under minimal assumptions, conformal prediction is valid in this new 
and powerful sense. 

One setting where conformal prediction is valid in the new on-line sense 
is the one in which the examples {xi,yi) are sampled independently from a 
constant population — i.e., from a fixed but unknown probability distribution 
Q. It is also valid under the slightly weaker assumption that the examples 
are probabilistically exchangeable (see ^ and under other on-line compression 
models, including the widely used Gaussian linear model (see ij5]). The validity 
of conformal prediction under these models is demonstrated in Appendix [A) 
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In addition to the validity of a method for producing 95% prediction regions, 
we are also interested in its efficiency. It is efficient if the prediction region is 
usually relatively small and therefore informative. In classification, we would 
like to see a 95% prediction region so small that it contains only the single 
predicted label y„. In regression, we would like to see a very narrow interval 
around the predicted number y„. 

The claim of 95% confidence for a 95% conformal prediction region is valid 
under exchangeability, no matter what the probability distribution Q the ex- 
amples follow and no matter what nonconformity measure is used to construct 
the conformal prediction region. But the efficiency of conformal prediction will 
depend on Q and the nonconformity measure. If we think we know Q, we may 
choose a nonconformity measure that will be efficient if we are right. If we have 
prior probabilities for Q, we may use these prior probabilities to construct a 
point predictor y„ and a nonconformity measure. In the regression case, we 
might use as ijn the mean of the posterior distribution for y„ given the first 
n — 1 examples and Xn', in the classification case, we might use the label with 
the greatest posterior probability. This strategy of first guaranteeing validity 
under a relatively weak assumption and then seeking efficiency under stronger 
assumptions conforms to advice long given by John Tukey and others (25l [2a | . 

Conformal prediction is studied in detail in Al gor ithmic Learning in a Ran- 
dom World, by Vovk, Gammerman, and Shafer [2g. A recent exposition by 
Gammerman and Vovk [l3| emphasizes connections with the theory of random- 
ness, Bayesian methods, and induction. In this article we emphasize the on-line 
concept of validity, the meaning of exchangeability, and the generalization to 
other on-line compression models. We leave aside many important topics that 
are treated in Algorithmic Learning in a Random World, including extensions 
beyond the on-line picture. 



2 Valid prediction regions 

Our concept of validity is consistent with a tradition that can be traced back 
to Jerzy Neyman's introduction of confidence intervals for parameters in 1937 
[igj and even to work by Laplace and others in the late 18th century. But 
the shift of emphasis to prediction (from estimation of parameters) and to the 
on-line setting (where our prediction rule is repeatedly updated) involves some 
rearrangement of the furniture. 

The most important novelty in conformal prediction is that its successive 
errors are probabilistically independent. This allows us to interpret "being right 
95% of the time" in an unusually direct way. In Sj2T] we illustrate this point 
with a well-worn example, normally distributed random variables. 

In §2.2) we contrast confidence with full-fledged conditional probability. This 
contrast has been the topic of endless debate between those who find confi- 
dence methods informative (classical statisticians) and those who insist that 
full-fledged probabilities based on all one's information are always preferable, 
even if the only available probabilities are very subjective (Bayesians). Because 
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the debate usually focuses on estimating parameters rather than predicting fu- 
ture observations, and because some readers may be unaware of the debate, 
we take the time to explain that we find the concept of confidence useful for 
prediction in spite of its limitations. 

2.1 An example of valid on-line prediction 

A 95% prediction region is valid if it contains the truth 95% of the time. To make 
this more precise, we must specify the set of repetitions envisioned. In the on-line 
picture, these are successive predictions based on accumulating information. We 
make one prediction after another, always knowing the outcome of the preceding 
predictions. 

To make clear what validity means and how it can be obtained in this on-line 
picture, we consider prediction under an assumption often made in a first course 
in statistics: 

Random variables zi , Z2 , • . • are independently drawn from a normal 
distribution with unknown mean and variance. 

Prediction under this assumption was discussed in 1935 by R. A. Fisher, who 
explained how to give a 95% prediction interval for Zn based on zi, . . . , Zn-i 
that is valid in our sense. We will state Fisher's prediction rule, illustrate its 
application to data, and explain why it is valid in the on-line setting. 

As we will see, the predictions given by Fisher's rule are too weak to be 
interesting from a modern machine-learning perspective. This is not surprising, 
because wc circ predicting Zfi bcised. on old exa-mples z^, . . . , Zn—i alone. In 
general, more precise prediction is possible only in the more favorable but more 
complicated set-up where we know some features x„ of the new example and 
can use both a;„ and the old examples to predict some other feature i/„. But the 
simplicity of the set-up where we predict Zn from zi, . . . , Zn-i alone will help us 
make the logic of valid prediction clear. 

2.1.1 Fisher's prediction interval 

Suppose we observe the Zi in sequence. After observing zi and Z2, we start 
predicting; for n = 3, 4, . . . , we predict z„ after having seen zi, . . . , Zn-i- The 
natural point predictor for z„ is the average so far: 



but we want to give an interval that will contain z„ 95% of the time. How can 
we do this? Here is Fisher's answer [lOj]: 

1. In addition to calculating the average z„_i, calculate 




ji-i 
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which is sometimes called the sample variance. We can usually assume 
that it is non-zero. 

2. In a table of percentiles for i-distributions, find i^-l^, the point that the 
t-distribution with n — 2 degrees of freedom exceeds exactly 2.5% of the 
time. 

3. Predict that z„ will be in the interval 

Z„_l±ef -n-l^^. (1) 

Fisher based this procedure on the fact that 

has the i-distribution with n — 2 degrees of freedom, which is symmetric about 
0. This implies that ^ will contain z„ with probability 95% regardless of the 
values of /i and . 

2.1.2 A numerical example 

We can illustrate IJ) using some numbers generated in 1900 by the students 
of Emanuel Czuber (1851-1925). These numbers are integers, but they theo- 
retically have a binomial distribution and are therefore approximately normally 
distributedQ 

Here are Czuber's first 19 numbers, zi, . . . , zig: 

17, 20, 10, 17, 12, 15, 19, 22, 17, 19, 14, 22, 18, 17, 13, 12, 18, 15, 17. (3) 

From them, we calculate 

zi9 = 16.53, si9 — 3.32. 

The upper 2.5% point for the i-distribution with 18 degrees of freedom, tig*^^, 
is 2.101. So the prediction interval ([T]) for Z20 comes out to [9.55,23.51]. 

Taking into account our knowledge that 2:20 will be an integer, we can say 
that the 95% prediction is that Z20 will be an integer between 10 and 23, inclu- 
sive. This prediction is correct; Z20 is 16. 

^Czuber's students randomly drew balls from an urn containing six balls, numbered 1 
to 6. Each time they drew a ball, they noted its label and put it back in the urn. After 
each 100 draws, they recorded the number of times that the ball labeled with a 1 was drawn 
(0! pp. 329—335). This should have a binomial distribution with parameters 100 and 1/6, 
and it is therefore approximately normal with mean 100/6 = 16.67 and standard deviation 
V500/36 = 3.79. 
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2.1.3 On-line validity 

Fisher did not have the on-line picture in mind. He probably had in mind a pic- 
ture where the formula ([T]) is used repeatedly but in entirely separate problems. 
For example, we might conduct many separate experiments that each consist 
of drawing 100 random numbers from a normal distribution and then predict- 
ing a 101st draw using ([1]). Each experiment might involve a different normal 
distribution (a different mean and variance), but provided the experiments are 
independent from each other, the law of large numbers will apply. Each time 
the probability is 95% that zioi will be in the interval, and so this event will 
happen approximately 95% of the time. 

The on-line story may seem more complicated, because the experiment in- 
volved in predicting zioi from zi , . . . , zioo is not entirely independent of the 
experiment involved in predicting, say, zios from zi, . . . , zio4. The 101 random 
numbers involved in the first experiment are all also involved in the second. But 
as a master of the analytical geometry of the normal distribution 0, Q , Fisher 
would have noticed, had he thought about it, that this overlap does not actually 
matter. As we show in Appendix I A. 3 [ the events 

r -, ^0.025 „ ^ / < 7 < ^ ,1 +0.025 „ ^ / ^4^, 

\ n — 1 \ n — 1 

for successive n are probabilistically independent in spite of the overlap. Because 
of this independence, the law of large numbers again applies. Knowing each 
event has probability 95%, we can conclude that approximately 95% of them 
will happen. We call the events ([U hits. 

The prediction interval ([T]) generalizes to linear regression with normally 
distributed errors, and on-line hits remain independent in this general setting. 
Even though formulas for these linear-regression prediction intervals appear in 
textbooks, the independence of their on-line hits was not noted prior to our 
work [2^. Like Fisher, the textbook authors did not have the on-line setting in 
mind. They imagined just one prediction being made in each case where data 
is accumulated. 

We will return to the generalization to linear regression in §5.3.21 There 
we will derive the textbook intervals as conformal prediction regions within the 
on-line Gaussian linear model, an on-line compression model that uses slightly 
weaker assumptions than the classical assumption of independent and normally 
distributed errors. 

2.2 Confidence says less than probability. 

Neyman's notion of confidence looks at a procedure before observations are 
made. Before any of the Zi are observed, the event ^ involves multiple uncer- 
tainties: Zn-i, Sn-i, and Zn are all uncertain. The probability that these three 
quantities will turn out so that ([4]) holds is 95%. 

We might ask for more than this, ft is after we observe the first n — 1 
examples that we calculate Zn-i and Sn-i and then calculate the interval ([1]), 



6 



and we would like to be able to say at this point that there is still a 95% 
probability that Zn will be in ([1]). But this, it seems, is asking for too much. 
The assumptions we have made are insufficient to enable us to find a numerical 
probability for ^ that will be valid at this late date. In theory there is a 
conditional probability for given zi, . . . , Zn-i, but it involves the unknown 
mean and variance of the normal distribution. 

Perhaps the matter is best understood from the game-theoretic point of 
view. A probability can be thought of as an offer to bet. A 95% probability, for 
example, is an offer to take either side of a bet at 19 to 1 odds. The probability is 
valid if the offer does not put the person making it at a disadvantage, inasmuch 
as a long sequence of equally reasonable offers will not allow an opponent to 



multiply the capital he or she risks by a large factor [2J]. When we assume 
a probability model (such as the normal model we just used or the on-line 
compression models we will study later), we are assuming that the model's 
probabilities are valid in this sense before any examples are observed. Matters 
may be different afterwards. 

In general, a 95% conformal predictor is a rule for using the preceding ex- 
amples (xi, . . . , (a;„_i, j/n-i) and a new object a;„ to give a set, say 

r°-°^((a;i,yi), . . . , (a;„_i, ?/„_i), x„), (5) 

that we predict will contain y„. If the predictor is valid, the prediction 

y„ e r"-°^((2;i,?;i), . . . , (a;„_i, 2/„_i), x„) 

will have a 95% probability before any of the examples are observed, and it 
will be safe, at that point, to offer 19 to 1 odds on it. But after we observe 
(xi, j/i), . . . , and Xn and calculate the set (O, we may want to 

withdraw the offer. 

Particularly striking instances of this phenomenon can arise in the case of 
classification, where there are only finitely many possible labels. We will see 
one such instance in tJ4.3.11 where we consider a classification problem in which 
there are only two possible labels, s and v. In this case, there are only four 
possibilities for the prediction region: 

1. r"-''^((a;i,?;i), . . . , (x„_i, y„_i), a:„) contains only s. 

2. r"-''^((a;i,?;i), . . . , (x„_i,y„_i),a;„) contains only v. 

3. r° "''^((a;i, yi), . . . , (x„_i, a;„) contains both s and v. 

4. r"-'''^((a;i,?;i), . . . , (x„_i,y„_i),a:„) is empty. 

The third and fourth cases can occur even though r^"^ is valid. When the third 
case happens, the prediction, though uninformative, is certain to be correct. 
When the fourth case happens, the prediction is clearly wrong. These cases are 
consistent with the prediction being right 95% of the time. But when we see 
them arise, we know whether the particular value of n is one of the 95% where 
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William S. Gossett Ronald A. Fisher Jerzy Neyman 

1876-1937 1890-1962 1894-1981 

Figure 1: Three influential statisticians. Gossett, who worked as a statis- 
tician for the Guinness brewery in Dublin, introduced the t-distribution to 
English-speaking statisticians in 1908 J^]. Fisher, whose applied and theo- 
retical work invigorated mathematical statistics in the 1920s and 1930s, refined, 
promoted, and extended Gossett's work. Neyman was one of the most influen- 
tial leaders in the subsequent movement to use advanced probability theory to 
give statistics a firmer foundation and further extend its applications. 



we are right or the one of the 5% where we are wrong, and so the 95% will not 
remain valid as a probability defining betting odds. 

In the case of normally distributed examples. Fisher called the 95% probabil- 
ity for z„ being in the interval ([1]) a "fiducial probability," and he seems to have 
believed that it would not be susceptible to a gambling opponent who knows the 
first n—1 examples (see pp. 119-125 of [l3l)- But this turned out not to be the 
case [i^l- For this and related reasons, most scientists who use Fisher's methods 
have adopted the interpretation offered by Neyman, who wrote about "confi- 
dence" rather than fiducial probability and emphasized that a confidence level 
is a full-fledged probability only before we acquire data. It is the procedure or 
method, not the interval or region it produces when applied to particular data, 
that has a 95% probability of being correct. 

Neyman's concept of confidence has endured in spite of its shortcomings. 
It is widely taught and used in almost every branch of science. Perhaps it is 
especially useful in the on-line setting. It is useful to know that 95% of our 
predictions are correct even if we cannot assert a full-fledged 95% probability 
for each prediction when we make it. 

3 Exchangeability 

Consider variables zi,...,zn- Suppose that for any collection of N values, 
the TV! different orderings are equally likely. Then we say that zi, . . . , zn are 
exchangeable. 

Exchangeability is closely related to the idea that examples are drawn inde- 
pendently from a probability distribution. As we explain in the next section, 
21 it is the basic model for conformal prediction. 
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In this section we look at the relationship between exchangeability and in- 
dependence and then give a backward-looking definition of exchangeability that 
can be understood game-theoretically. We conclude with a law of large numbers 
for exchangeable sequences, which will provide the basis for our confidence that 
our 95% prediction regions are right 95% of the time. 



3.1 Exchangeability and independence 

Although the definition of exchangeability we just gave may be clear enough at 
an intuitive level, it has two technical problems that make it inadequate as a 
formal mathematical definition: (1) in the case of continuous distributions, any 
specific values for zi, . . . , will have probability zero, and (2) in the case of 
discrete distributions, two or more of the Zi might take the same value, and so 
a list of possible values ai, . . . , gn might contain fewer than n distinct values. 

One way of avoiding these technicalities is to use the concept of a permuta- 
tion, as follows: 

Definition of exchangeability using permutations. The variables 
zi, . . . , z^v are exchangeable if for every permutation t of the in- 
tegers l,...,iV, the variables wi, . . . ,wn, where Wi — 2,-(j)i have 
the same joint probability distribution as zi, . . . , zn- 

We can extend this to a definition of exchangeability for an infinite sequence of 
variables: zi, Z2, . . . are exchangeable if zi, . . . , zn are exchangeable for every 
N. 

This definition makes it easy to see that independent and identically dis- 
tributed random variables are exchangeable. Suppose zi, . . . , z^r all take values 
from the same example space Z, all have the same probability distribution Q, 
and are independent. Then their joint distribution satisfies 

Pr (zi e v4i & . . . & ZAT e An) = Q{Ai) ■ ■ ■ Q{An) (6) 

for anjH subsets Ai, . . . , A^ of Z, where Q{A) is the probability Q assigns to an 
example being in A. Because permuting the factors Q{An) does not change their 
product, and because a joint probability distribution for zi, . . . , zat is determined 
by the probabilities it assigns to events of the form {zi g yli & ... & za? G An}, 
this makes it clear that zi, . . . , zjv are exchangeable. 

Exchangeability implies that variables have the same distribution. On the 
other hand, exchangeable variables need not be independent. Indeed, when we 
average two or more distinct joint probability distributions under which vari- 
ables are independent, we usually get a joint probability distribution under 
which they are exchangeable (averaging preserves exchangeability) but not in- 
dependent (averaging usually does not preserve independence) . According to a 
famous theorem by de Finetti, an exchangeable joint distribution for an infinite 
sequence of distinct variables is exchangeable only if it is a mixture of joint 



distributions under which the variables are independent 15[. As Table [T] shows, 
the picture is more complicated in the finite case. 

^We leave aside technicalities involving measurability. 
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Pr(^i = H & 02 = H) 


Pr(0i =Rkz2 = T) 


Pr(:i =T.^- -0 = 11) 


Pr(:i = T Z2 = T) 



0.81 


0.09 


0.09 


0.01 



0.41 


0.09 


0.09 


0.41 



0.10 


0.40 


0.40 


0.10 



Table 1: Examples of exchangeability. We consider variables zi and Z2, 

each of which comes out H or T. Exchangeabihty requires only that Pr(zi = 
a k Z2 = T) — Pr(zi = T & 22 = H). Three examples of distributions for zi 
and Z2 with this property are shown. On the left, Zi and Z2 are independent 
and identically distributed; both come out H with probability 0.9. The mid- 
dle example is obtained by averaging this distribution with the distribution in 
which the two variables are again independent and identically distributed but 
T's probability is 0.9. The distribution on the right, in contrast, cannot be 
obtained by averaging distributions under which the variables are independent 
and identically distributed. Examples of this last type disappear as we ask for 
a larger and larger number of variables to be exchangeable. 

3.2 Backward-looking definitions of exchangeability 

Another way of defining exchangeability looks backwards from a situation where 
we know the unordered values of . . . , ^;jv. 

Suppose ,Joc has observed zi, . . . , zjy. He writes each value on a tile resem- 
bling those used in Scrabble®, puts the N tiles in a bag, shakes the bag, and 
gives it to Bill to inspect. Bill sees the N values (some possibly equal to each 
other) without knowing their original order. Bill also knows the joint proba- 
bility distribution for zi,...,zn- So he obtains probabilities for the ordering 
of the tiles by conditioning this joint distribution on his knowledge of the bag. 
The joint distribution is exchangeable if and only if these conditional proba- 
bilities are the same as the probabilities for the result of ordering the tiles by 
successively drawing them at random from the bag without replacement. 

To make this into a definition of exchangeability, we formalize the notion of 
a bag. A bag (or multiset, as it is sometimes called) is a collection of elements 
in which repetition is allowed. It is like a set inasmuch as its elements are 
unordered but like a list inasmuch as an clement can occur more than once. We 
write "[at, . . . , oat j for the bag obtained from the list ai, . . . , un by removing 
information about the ordering. 

Here arc three equivalent conditions on the joint distribution of a sequence 
of random variables zi,. . . ,zn, any of which can be taken as the definition of 
exchangeability. 
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Figure 2: Ordering the tiles. Joe gives Bill a bag containing five tiles, and Bill 
arranges them to form the list 43477. Bill can calculate conditional probabilities 
for which Zi had which of the five values. His conditional probability for Z5 = 4, 
for example, is 2/5. There are (5!)/(2!)(2!) = 30 ways of assigning the five values 
to the five variables; (21,22,23,24,2:5) — (4,3,4,7,7) is one of these, and they 
all have the same probability, 1/30. 

1. For any bag B of size A^, and for any examples ai, . . . , oat, 

Pr (21 = fli & ... !k — ajv | \zi, . . . , z^] = B) 

is equal to the probability that successive random drawings from the bag 
B without replacement produces first oat, then ajv-i, and so on, until the 
last element remaining in the bag is oi. 

2. For any n, 1 < n < A^, 2„ is independent of Zn+i, . . . , zn given the bag 
1^21, . . . , 2„j and for any bag B of size n, 

k 

Pr (2„ = a I ]^2i, . . . , 2„j = B) = -, (7) 

n 

where k is the number of times a occurs in B. 

3. For any bag B of size N , and for any examples ai, . . . , a at, 

Pr (21 = oi & ... &z zj\j ^ UN I Izi, . . . , zj\j^ = B) 

^{ml_^ iiB^la^,...,a^^ 
|0 if B ^ lai, . . . ,aNl, 

where k is the number of distinct values among the Oi, and ni, . . . , are 
the respective numbers of times they occur. (If the Ui are all distinct, the 
expression ni! • ■ •nfe!/(7V!) reduces to 1/(A^!).) 

We leave it to the reader to verify that these three conditions are equivalent 
to each other. The second condition, which we will emphasize, is represented 
pictorially in Figure [3l 

The backward-looking conditions are also equivalent to the definition of ex- 
changeability using permutations given on p. [9l This equivalence is elemen- 
tary in the case where every possible sequence of values ai , . . . , a„ has positive 
probability. But complications arise when this probability is zero, because the 
conditional probability on the left-hand side of ([H) is then defined only with 
probability one by the joint distribution. We do not explore these complica- 
tions here. 
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Zl Zl zat-i zjv 



□ \zxl -> lzi,Z2\'> ^'[zi,...,ZAr_i]'--|^Zi,...,ZArj 



Figure 3: Backward probabilities, step by step. The two arrows backwards 
from each bag \zi^ . . . ,z„j symbohze drawing an example z„ out at random, 
leaving the smaller bag \zi^ . . . , z„_i The probabilities for the result of the 
drawing are given by ([7]). Readers familiar with Bayes nets [4] will recognize 
this diagram as an example; conditional on each variable, a joint probability 
distribution is given for its children (the variables to which arrows from it point), 
and given the variable, its descendants are independent of its ancestors. 

3.3 The betting interpretation of exchangeability 

The framework for probability developed in [2J] formalizes classical results of 
probability theory, such as the law of large numbers, as theorems of game theory: 
a bettor can multiply the capital he risks by a large factor if these results do not 
hold. This allows us to express the empirical interpretation of given probabilities 
in terms of betting, using what we call Cournot 's principle: the odds determined 
by the probabilities will not allow a bettor to multiply the capital he or she risks 
by a large factor 

By applying this idea to the sequence of probabilities (O, we obtain a betting 
interpretation of exchangeability. Think of Joe and Bill as two players in a game 
that moves backwards from point N in Figure [31 At each step, Joe provides 
new information and Bill bets. Designate by ICn the total capital Bill risks. He 
begins with this capital at N, and at each step n he bets on what z„ will turn 
out to be. When he bets at step n, he cannot risk losing more than he has at 
that point (because he is not risking more than M^^r in the whole game), but 
otherwise he can bet as much as he wants for or against each possible value a 
for Zn at the odds (k/n) : (1 — k/n), where k is the number of elements in the 
current bag equal to a. 

For brevity, we write _B„ for the bag ^zi, . . . , z„ j, and for simplicity, we set 
the initial capital K^f equal to $1. This gives the following protocol: 

The Backward-Looking Betting Protocol 
Players: Joe, Bill 
K-N 1. 

Joe announces a bag _Bjv of size N. 
FORn = Af,iV-l,...,2,l 

Bill bets on z„ at odds set by ([7]). 

Joe announces z„ G i?„. 

/C„_i := ICn + Bill's net gain. 
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Bn-1 ■— B„ \ IZn^- 

Constraint: Bill must move so that his capital /C„ will be nonnegative for all 
n no matter how Joe moves. 

Our betting interpretation of exchangeability is that Bill will not multiply his 
initial capital ICn i>y a large factor in this game. 

The permutation definition of exchangeability does not lead to an equally 
simple betting interpretation, because the probabilities for zi, . . . , z^r to which 
the permutation definition refers are not determined by the mere assumption of 
exchangeability. 

3.4 A law of large numbers for exchangeable sequences 

As we noted when we studied Fisher's prediction interval in H2.1.31 the validity 
of on-line prediction requires more than having a high probability of a hit for 
each individual prediction. We also need a law of large numbers, so that we 
can conclude that a high proportion of the high-probability predictions will be 
correct. As we show in HA.3\ the successive hits in the case of Fisher's region 
predictor are independent, so that the usual law of large numbers applies. What 
can we say in the case of conformal prediction under exchangeability? 

Suppose zi, . . . , zn are exchangeable, drawn from an example space Z. In 
this context, we adopt the following definitions. 

• An event E is an n-event, where 1 < n < A'', if its happening or failing is 
determined by the value of 2„ and the value of the bag Izi, . . . , z„_i j. 

• An n-event E is e-rare if 

PviE\lzi,...,z„\) <e. (9) 

The left-hand side of the inequality ([Ql) is a random variable, because the bag 
Izi, . . . , Zn\ is random. The inequality says that this random variable never 
exceeds e. 

As we will see in the next section, the successive errors for a conformal 
predictor are e-rare n-events. So the validity of conformal prediction follows 
from the following informal proposition. 

Informal Proposition 1 Suppose N is large, and the variables Zi, . . . , Zn are 

exchangeable. Suppose En is an e-rare n-event for n = 1, . . . , N . Then the law 
of large numbers applies; with very high probability, no more than approximately 
the fraction e of the events Ei, . . . , En will happen. 

In Appendix [XI we formalize this informal proposition in two ways: classically 
and game-theoretically. 

The classical approach appeals to the classical weak law of large numbers, 
which tells us that if . . . , E^ are mutually independent and each have prob- 
ability exactly e, and N is sufficiently large, then there is a very high probability 
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that the fraction of the events that happen will be close to e. We show in ^Ki^ 
that if ([9]) holds with equality, then En are mutually independent and each of 
them has unconditional probability e. Having the inequality instead of equality 
means that the En are even less likely to happen, and this will not reverse the 
conclusion that few of them will happen. 

The game-theoretic approach is more straightforward, because the game- 
theoretic version law of large numbers does not require independence or exact 
levels of probability. In the game-theoretic framework, the only question is 
whether the probabilities specified for successive events are rates at which a 
bettor can place successive bets. The Backward-Looking Betting Protocol says 
that this is the case for e-rare n-events. As Bill moves through the protocol from 
A'' to 1, he is allowed to bet against each error En at a rate corresponding to its 
having probability e or less. So the game-theoretic weak law of large numbers 
(H) PP- 124-126) apphes directly. Because the game-theoretic framework is 
not well known, we state and prove this law of large numbers, specialized to the 
Backward-Looking Betting Protocol, in ijA.2l 

4 Conformal prediction under exchangeability 

We are now in a position to state the conformal algorithm under exchangeability 
and explain why it produces valid nested prediction regions. 

We distinguish two cases of on-line prediction. In both cases, we observe 
examples zi,...,zn one after the other and repeatedly predict what we will 
observe next. But in the second case we have more to go on when we make each 
prediction. 

1. Prediction from old examples alone. Just before observing z„, we predict 
it based on the previous examples, zi, . . . , z„_i. 

2. Prediction using features of the new object. Each example Zi consists of an 
object Xi and a label yi. In symbols: Zi — {xi,yi). We observe in sequence 
Xi,yi, . . . ,XN,yN- Just before observing y„, we predict it based on what 
we have observed so far, Xn and the previous examples zi, . . . , z„_i. 

Prediction from old examples may seem relatively uninteresting. It can be 
considered a special case of prediction using features x„ of new examples — the 
case in which the Xn provide no information, and this special case we may 
have too little information to make useful predictions. But its simplicity makes 
prediction with old examples alone advantageous as a setting for explaining 
the conformal algorithm, and as we will see, it is then straightforward to take 
account of the new information a;„. 

Conformal prediction requires that we first choose a nonconformity measure, 
which measures how different a new example is from old examples. In t i4.11 we 
explain how nonconformity measures can be obtained from methods of point 
prediction. In i)4.2[ we state and illustrate the conformal algorithm for predict- 
ing new examples from old examples alone. In §4.31 we generalize to prediction 
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with the help of features of a new example. In i j4.41 we explain why conformal 
prediction produces the best possible valid nested prediction regions under ex- 
changeability. Finally, in ij4.5l we discuss the implications of the failure of the 
assumption of exchangeability. 

For some readers, the simplicity of the conformal algorithm may be obscured 
by its generality and the scope of our preliminary discussion of nonconformity 
measures. We encourage such readers to look first at i)4.2.1i i )4.3.11 and i|4.3.2[ 
which provide largely self-contained accounts of the algorithm as it applies to 
some small datasets. 

4.1 Nonconformity measures 

The starting point for conformal prediction is what we call a nonconformity 
measure, a real-valued function A(B, z) that measures how different an exam- 
ple z is from the examples in a bag B. The conformal algorithm assumes that 
a nonconformity measure has been chosen. The algorithm will produce valid 
nested prediction regions using any real-valued function A{B, z) as the non- 
conformity measure. But the prediction regions will be efficient (small) only if 
A[B,z) measures well how different z is from the examples in B. 

A method z{B) for obtaining a point prediction z for a new example from a 
bag B of old examples usually leads naturally to a nonconformity measure A. 
In many cases, we only need to add a way of measuring the distance d{z, z') 
between two examples. Then we define A by 

A{B,z) -.^ d{z{B),z). (10) 

The prediction regions produced by the conformal algorithm do not change when 
the nonconformity measure A is transformed monotonically. If ^ is nonnegative, 
for example, replacing A with A^ will make no difference. Consequently, the 
choice of the distance measure d{z, z') is relatively unimportant. The important 
step in determining the nonconformity measure A is choosing the point predictor 

m- 

To be more concrete, suppose the examples are real numbers, and write 
'zb for the average of the numbers in B. If we take this average as our point 
predictor z{B), and we measure the distance between two real numbers by the 
absolute value of their difference, then (fTO]) becomes 

A{B,z):^\zB- z\. (11) 

If we use the median of the numbers in B instead of their average as z{B), we 
get a different nonconformity measure, which will produce different prediction 
regions when we use the conformal algorithm. On the other hand, as we have 
already said, it will make no difference if we replace the absolute difference 
d{z, z') = \z — z'\ with the squared difference c?(z, z') = (z — z')^, thus squaring 
A. 

We can also vary (jlip by including the new example in the average: 

A{B, z) := [(average of z and all the examples in B) — z\ . (12) 
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This results in the same prediction regions as because if B has n elements, 
then 

[(average of z and all the examples in B) — z\ 

nzB + z 
~ n+1 

and as we have said, conformal prediction regions are not changed by a mono- 
tonic transformation of the nonconformity measure. In the numerical example 
that we give in i|4.2.1l below. we use (jl2p as our nonconformity measure. 

When we turn to the case where features of a new object help us predict 
a new label, we will consider, among others, the following two nonconformity 
measures: 



\ZB 



Distance to the nearest neighbors for classification. Suppose B = 
Izi, . . . , z„_i j, where each Zi consists of a number Xi and a nonnumcrical label 
Ui. Again we observe x but not y for a new example z = {x,y). The nearest- 
neighbor method finds the Xi closest to x and uses its label yi as our prediction 
of y. If there are only two labels, or if there is no natural way to measure the 
distance between labels, we cannot measure how wrong the prediction is; it is 
simply right or wrong. But it is natural to measure the nonconformity of the 
new example (x, y) to the old examples (xi, yi) by comparing a;'s distance to old 
objects with the same label to its distance to old objects with a different label. 
For example, we can set 

. _ mm{\x^ ~ x\ : 1 < i < n - l,y^ ^ y} 

mm{\xi ~ x\ : 1 <i < n - l,yt ^ y} ^^^^ 
distance to z's nearest neighbor in B with the same label 
distance to z's nearest neighbor in B with a different label 



Distance to a regression line. Suppose B = l{xi,yi), . . . ,{xi,yi)^, where 
the Xi and yi are numbers. The most common way of fitting a line to such pairs 
of numbers is to calculate the averages 

I I 
XI := Xj and j/; ^ y^, 

and then the coefficients 



bi = — ^ and ai = yi - bixi. 



ill 



This gives the least-squares line y = ai + bix. The coefficients a/ and 6/ are not 
affected if we change the order of the z^; they depend only on the bag B. 
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If wc observe a bag B = ^zi, . . . , z„^i j of examples of the form Zi = (xj, yi) 
and also x but not y for a new example z = {x,y), then the least-squares 
prediction of y is 

y = an-i+bn-ix. (14) 
We can use the error in this prediction as a nonconformity measure: 

A{B, z) := \y-y\ = \y- (a„_i + 6„_ia;)|. 

We can obtain other nonconformity measures by using other methods to esti- 
mate a line. 

Alternatively, we can include the new example as one of the examples used 
to estimate the least squares line or some other regression line. In this case, it 
is natural to write (.x„, y„) for the new example. Then a„ and 6„ designate the 
coefficients calculated from all n examples, and we can use 

\yi - {a-n + bnXi)\ (15) 

to measure the nonconformity of each of the (.x^, y,i) with the others. In general, 
the inclusion of the new example simplifies the implementation or at least the 
explanation of the conformal algorithm. In the case of least squares, it does not 
change the prediction regions. 

4.2 Conformal prediction from old examples alone 

Suppose we have chosen a nonconformity measure A for our problem. Given A, 
and given the assumption that the Zi are exchangeable, we now define a valid 
prediction region 

Y{ZI, . . . , Zn-l) C Z, 

where Z is the example space. We do this by giving an algorithm for deciding, 
for each 2 G Z, whether z should be included in the region. For simplicity in 
stating this algorithm, we provisionally use the symbol Zn for z, as if we were 
assuming that Zn is in fact equal to z. 

The Conformal Algorithm Using Old Examples Alone 

Input: Nonconformity measure A, significance level e, examples 
example z, 

Task: Decide whether to include z in 7^(-2i, . . . , Zn-\)- 
Algorithm: 

1. Provisionally set Zn ~ z. 

2. For i = 1, . . . ,n, set := A{\zi, ...,Zn]\\ Zil,Zi). 

„ ^ number of i such that 1 <i <n and ctj > a„ 

3. Set pz '■— • 

n 

4. Include z in • • • , -^n-i) if and only if p^ > e. 
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If Z has only a few elements, this algorithm can be implemented in a brute-force 
way: calculate pz for every z S Z. If Z has many elements, we will need some 
other way of identifying the z satisfying pz > e. 

The number pz is the fraction of the examples in Izi, . . . , Zn-i, z\ that are 
at least as different from the others as z is, in the sense measured by A. So the 
algorithm tells us to form a prediction region consisting of the z that are not 
among the fraction e most out of place when they are added to the bag of old 
examples. 

The definition of 7'^(2i, . . . ,Zn-i) can be framed as an application of the 
widely accepted Neyman-Pearson theory for hypothesis testing and confidence 



intervals [17[. In the Neyman-Pearson theory, we test a hypothesis H using 
a random variable T that is likely to be large if H is false. Once we observe 
T = t, we calculate pn ■— Pr(r > t \ H). We reject H at level e if ph < e- 
Because this happens under H with probability no more than e, we can declare 
1 — e confidence that the true hypothesis H is among those not rejected. Our 
procedure makes these choices of H and T: 

• The hypothesis H says the bag of the first n examples is ^zi, . . . , z„_i, zj. 

• The test statistic T is the random value of an- 

Under H — i.e., conditional on the bag ^zi, . . . , z„_i, zj, T is equally likely to 
come out equal to any of the a^. Its observed value is q;„. So 

PH = Pr(r > an I Izi, . . . , Zn-l, zj) = Pz. 

Since zi, . . . , z„_i are known, rejecting the bag |zi, . . . , z„_i, zj means rejecting 
Zn — z. So our 1 — e confidence is in the set of z for which pz > e. 

The regions 7'^(zi, . . . , z„_i) for successive n are based on overlapping ob- 
servations rather than independent observations. But the successive errors 
are e-rare n-events. The event that our nth prediction is an error, z„ ^ 
7'^(zi, . . . ,z„_i), is the event Pz„ < e. This is an n-event, because the value 
of Pz^ is determined by Zn and the bag ^zi, . . . , z„_i]". It is e-rare because it 
is the event that a„ is among a fraction e or fewer of the ai that are strictly 
larger than all the other ai, and this can have probability at most e when the 
ai are exchangeable. So it follows from Informal Proposition [1] f iJ3.4p that we 
can expect at least 1 — e of the 7'^(zi, . . . , z„_i), n = 1, . . . , A^, to be correct. 



4.2.1 Example: Predicting a number v^rith an average 

In H2.ll we discussed Fisher's 95% prediction interval for z„ based on 
zi,...,z„_i, which is valid under the assumption that the Zi are indepen- 
dent and normally distributed. We used it to predict Z20 when the first 19 z^ 
are 

17, 20, 10, 17, 12, 15, 19, 22, 17, 19, 14, 22, 18, 17, 13, 12, 18, 15, 17. 

Taking into account our knowledge that the Zi are all integers, we arrived at the 
95% prediction that Z20 is an integer between 10 to 23, inclusive. 
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What can we predict about Z20 at the 95% level if we drop the assumption 
of normality and assume only exchangeability? To produce a 95% prediction 
interval valid under the exchangeability assumption alone, we reason as follows. 
To decide whether to include a particular value z in the interval, we consider 
twenty numbers that depend on z: 

• First, the deviation of z from the average of it and the other 19 numbers. 
Because the sum of the 19 is 314, this is 



314 + z 



20 



= ^|314-19z| 



(16) 



• Then, for i = 1, . . . , 19, the deviation of Zi from this same average. This 
is 



314 + z 



20 



^|314 + z-20z. 



(17) 



Under the hypothesis that z is the actual value of z„, these 20 numbers are 
exchangeable. Each of them is as likely as the other to be the largest. So there 
is at least a 95% (19 in 20) chance that (fTB)) will not exceed the largest of the 
19 numbers in p?)) . The largest of the 19 z^s being 22 and the smallest 10, we 
can write this condition as 



|314- 19z| < max{|314+z- (20 x 22)|,|314 + z- (20 x 10)|}^ 



(18) 



which reduces to 



10 < < 



214 
"9~ 



23.8 



Taking into account that Z20 is an integer, our 95% prediction is that it will be 
an integer between 10 and 23, inclusive. This is exactly the same prediction we 
obtained by Fisher's method. We have lost nothing by weakening the assump- 
tion that the Zi are independent and normally distributed to the assumption 
that they are exchangeable. But we are still basing our prediction region on 
the average of old examples, which is an optimal estimator in various respects 
under the assumption of normality. 



4.2.2 Are v^re complicating the story unnecessarily? 

The reader may feel that we are vacillating about whether to include the new 
example in the bag with which we are comparing it. In our statement of the 
conformal algorithm, we define the nonconformity scores by 

OLi := A(|zi, ...,Zn\\\ Zi], Zi), (19) 

apparently signaling that we do not want to include Zi in the bag to which it is 
compared. But then we use the nonconformity measure 

A{B, z) :— [(average of z and all the examples in B) ~ z\ , 
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which seems to put z back in the bag, reducing to 



En 
1=1 ^: 



n 



We could have reached this point more easily by writing 



ai := A{lzi, .. ., 



(20) 



in the conformal algorithm and using A{B, z) :~ \zb — z\ . 

The two ways of defining nonconformity scores, and (|20p . are equivalent, 
inasmuch as whatever we can get with one of them we can get from the other 
by changing the nonconformity measure. In this case, ([^0]) might be more 
convenient. But we will see other cases where is more convenient. We also 
have another reason for using ^9]) . It is the form that generalizes, as we will 
see in [JSJ to on-line compression models. 

4.3 Conformal prediction using a new object 

Now we turn to the case where our example space Z is of the form Z = X x Y. 
We call X the object space, Y the label space. We observe in sequence examples 
zi, . . . , zat, where Zi — {xi,yi). At the point where we have observed 

Zi , . . . , Z^^_ 1 , — i.'^li yi) -} • ■ • 1 ('^n— 1 : yn—l) : Xyi , 

we want to predict ?/„ by giving a prediction region 

r^(zi,...,z„_i,a;„) C Y 

that is valid at the (1 — e) level. As in the special case where the Xi are absent, 
we start with a nonconformity measure A{B, z). 

We define the prediction region by giving an algorithm for deciding, for each 
2/ G Y, whether y should be included in the region. For simplicity in stating this 
algorithm, we provisionally use the symbol z„ for (a;„, y), as if we were assuming 
that j/„ is in fact equal to y. 

The Conformal Algorithm 

Input: Nonconformity measure A, significance level e, examples zi,...,z„_i, 
object Xn, label y 

Task: Decide whether to include y in T' (zi , . . . 

^n— 1 1 Xn } . 

Algorithm: 

1. Provisionally set z„ := {xn,y)- 

2. For i = 1, . . . , n, set a, := A{lzi, . . . ,Zn\\l Zi'], Zi). 



3. Set py : 



#{i = 1, . . . ,n I > an} 



n 



4. Include y in r'(zi, . . . , z„_i, Xn) if and only if py > e. 
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This differs only sliglitly from the conformal algorithm using old examples alone 
(p. [TT)) . Now we write Py instead of Pz, and we say that we are including y in 
r'^(zi, . . . , Zn-i, Xn) instead of saying that we are including z in 'y'^{zi, . . . , Zn~i). 

To see that this algorithm produces valid prediction regions, it suffices to see 
that it consists of the algorithm for old examples alone together with a further 
step that does not change the frequency of hits. We know that the region the 
old algorithm produces, 

7^(zi,...,z„_i) C Z, (21) 

contains the new example z„ = (a;„, yn) at least 95% of the time. Once we know 
Xn, we can rule out all z — {x,y) in (|2ip with x ^ Xn- The y not ruled out, 
those such that {xn , y) is in (|2T|) , are precisely those in the set 

r(zi,...,2„_i,a;„) C Y (22) 

produced by our new algorithm. Having (xn,yn) in (|2ip 1 — e of the time is 
equivalent to having ?/„ in ((22|) 1 — e of the time. 

4.3.1 Example: Classifying iris flowers 

In 1936 jTJ|, R. A. Fisher used discriminant analysis to distinguish different 
species of iris on the basis of measurements of their flowers. The data he used 
included measurements by Edgar Anderson of flowers from 50 plants each of two 
species, iris setosa and iris versicolor. Two of the measurements, sepal length 
and petal width, are plotted in Figure ID 

To illustrate how the conformal algorithm can be used for classification, we 
have randomly chosen 25 of the 100 plants. The sepal lengths and species for 
the first 24 of them are listed in Table [2] and plotted in Figure [H The 25th 
plant in the sample has sepal length 6.8. On the basis of this information, 
would you classify it as setosa or versicolor, and how confident would you be 
in the classification? Because 6.8 is the longest sepal in the sample, nearly any 
reasonable method will classify the plant as versicolor, and this is in fact the 
correct answer. But the appropriate level of confidence is not so obvious. 

We calculate conformal prediction regions using three different nonconfor- 
mity measures: one based on distance to the nearest neighbors, one based on 
distance to the species average, and one based on a support-vector machine. Be- 
cause our evidence is relatively weak, we do not achieve the high precision with 
high confidence that can be achieved in many applications of machine learning 
(see, e.g., M.Sp . But we get a clear view of the details of the calculations and 
the interpretation of the results. 

Distance to the nearest neighbor belonging to each species. Here we 
use the nonconformity measure ([13]). The fourth and fifth columns of Table [2] 
(labeled NN for nearest neighbor) give nonconformity scores ai obtained with 
J/25 = s and y25 = v, respectively. In both cases, these scores are given by 

ai = Adzi, . . . , Z25 I \ 1 2 J, Zi) 

_ minjlxj - Xi \ : 1 < j < 25 Sz j ^ i k yj = y,} (23) 
minjlxj ~ Xi \ : 1 < j < 25 Sz j ^ i Sz yj yt} ' 
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Figure 4: Sepal length, petal width, and species for Edgar Anderson's 
100 flowers. The 50 iris setosa are clustered at the lower left, while the 50 iris 
versicolor are clustered at the upper right. The numbers indicate how many 
plants have exactly the same measurement; for example, there are 5 plants that 
have sepals 5 inches long and petals 0.2 inches wide. Petal width separates the 
two species perfectly; all 50 versicolor petals are 1 inch wide or wider, while all 
setosa petals are narrower than 1 inch. But there is substantial overlap in sepal 
length. 
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Figure 5: Sepal length and species for the first 24 plants in our random 
sample of size 25. Except for one versicolor with sepal length 5.0, the versi- 
color in this sample all have longer sepals than the setosa. This high degree of 
separation is an accident of the sampling. 
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0.08 




Pv 








0.32 




0.08 




1 



Table 2: Conformal prediction of iris species from sepal length, using 
three different nonconformity measures. The data used are sepal length 
and species for a random sample of 25 of the 100 plants measured by Edgar 

Anderson. The second column gives Xi, the sepal length. The third column 
gives yi, the species. The 25th plant has sepal length 2:25 = 6.8, and our task 
is to predict its species 2/25- For each nonconformity measure, we calculate 
nonconformity scores under each hypothesis, 1/25 = s and j/25 = v. The p-vahie 
in each column is computed from the 25 nonconformity scores in that column; 
it is the fraction of them equal to or larger than the 25th. The results from the 
three nonconformity measures are consistent, inasmuch as the p-value for v is 
always larger than the p- value for s. 
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but for the fourth column Z25 = (6.8, s), while for the fifth column Z25 = (6.8, v). 

If both the numerator and the denominator in (j23p are equal to zero, we 
take the ratio also to be zero. This happens in the case of the first plant, for 
example. It has the same sepal length, 5.0, as the 7th and 13th plants, which 
are setosa, and the 15th plant, which is versicolor. 

Step 3 of the conformal algorithm yields Ps = 0.08 and Pv — 0.32. Step 4 
tells us that 

• s is in the 1 — e prediction region when 1 — e > 0.92, and 

• V is in the 1 — e prediction region when 1 — e > 0.68. 
Here are prediction regions for a few levels of e. 

• r° "^ — {v}. With 92% confidence, we predict that 2/25 = v. 

• r""^ = {s, v}. If we raise the confidence with which we want to predict 
2/25 to 95%, the prediction is completely uninformative. 

• r^/^ = 0. If we lower the confidence to 2/3, we get a prediction we know 
is false: 1/25 will be in the empty set. 

In fact, 2/25 = V. Our 92% prediction is correct. 

The fact that we are making a known-to-be- false prediction with 2/3 confi- 
dence is a signal that the 25th sepal length, 6.8, is unusual for either species. 
A close look at the nonconformity scores reveals that it is being perceived as 
unusual simply because 2/3 of the plants have other plants in the sample with 
exactly the same sepal length, whereas there is no other plant with the sepal 
length 6.8. 

In classification problems, it is natural to report the greatest 1 — e for which 
r*^ is a single label. In our example, this produces the statement that we are 
92% confident that 2/25 is v. But in order to avoid overconfidence when the 
object Xn is unusual, it is wise to report also the largest e for which is empty. 
We call this the credibility of the prediction (^8|, p. 96). In our example, the 
prediction that 2/25 will be v has credibility of only 32%|f| 

Distance to the average of each species. The nearest-neighbor nonconfor- 
mity measure, because it considers only nearby sepal lengths, does not take full 
advantage of the fact that a versicolor flower typically has longer sepals than 
a setosa flower. We can expect to obtain a more efficient conformal predictor 
(one that produces smaller regions for a given level of confidence) if we use a 
nonconformity measure that takes account of average sepal length for the two 
species. 

We use the nonconformity measure A defined by 

MB, {x, 2/)) = \xBuUx,y)ly " x\, (24) 

•^This notion of credibility is one of the novelties of the theory of conformal prediction. It 
is not found in the prior literature on confidence and prediction regions. 
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where XB,y denotes the average sepal length of all plants of species y in the bag 
B, and B U lz\ denotes the bag obtained by adding z to B. To test 2/25 = s, we 
consider the bag consisting of the 24 old examples together with (6.8, s), and 
we calculate the average sepal lengths for the two species in this bag: 5.06 for 
setosa and 6.02 for versicolor. Then we use to calculate the nonconformity 
scores shown in the sixth column of Table O 

J |5.06 - XjI if Ui = s 
ai = < 

\^ |6.02 - Xi\ if y.i = V 

for j = 1, . . . , 25, where we take y25 to be s. To test j/25 — v, we consider the 
bag consisting of the 24 old examples together with (6.8, v), and we calculate 
the average sepal lengths for the two species in this bag: 4.94 for setosa and 6.1 
for versicolor. Then we use (j24[) to calculate the nonconformity scores shown in 
the seventh column of Table [D 




4.94 - x.i I if y.i = s 
6.1 ~ Xi\ if y,, = V 



for i — 1, . . . , 25, where we take j/25 to be v. 
We obtain ps — 0.04 and pv — 0.08, so that 

• s is in the 1 — e prediction region when 1 — e > 0.96, and 

• V is in the 1 — e prediction region when 1 — e > 0.92. 
Here are the prediction regions for some different levels of e. 

• r° ""' — {v}. With 96% confidence, we predict that j/25 — v. 

• r^ '^^ = {s, v}. If we raise the confidence with which we want to predict 
2/25 to 97%, the prediction is completely uninformative. 

• r^-'^^ — 0. If we lower the confidence to 92%, we get a prediction we know 
is false: 1/25 will be in the empty set. 

In this case, we predict y25 — v with confidence 96% but credibility only 8%. 
The credibility is lower with this nonconformity measure because it perceives 
6.8 as being even more unusual than the nearest-neighbor measure did. It is 
unusually far from the average sepal lengths for both species. 



A support-vector machine. As Vladimir Vapnik explains on pp. 408-410 of 
his Statistical Learning Theory 27|, support- vector machines grew out of the 
idea of separating two groups of examples with a hyperplane in a way that 
makes as few mistakes as possible — i.e., puts as few examples as possible on 
the wrong side. This idea springs to mind when we look at Figure [H In this 
one-dimensional picture, a hyperplane is a point. We are tempted to separate 
the setosa from the versicolor with a point between 5.5 and 5.7. 
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Figure 6: Separation for three bags. In each case, the separating band is 
the interval [5.5, 5.7]. Examples on the wrong side of the interval are considered 
strange and are circled. 
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Vapnik proposed to separate two groups not with a single hyperplane but 
with a band: two hyperplanes with few or no examples between them that 
separate the two groups as well as possible. Examples on the wrong side of both 
hyperplanes would be considered very strange; those between the hyperplanes 
would also be considered strange but less so. In our one-dimensional example, 
the obvious separating band is the interval from 5.5 to 5.7. The only strange 
example is the versicolor with sepal length 5.0. 

Here is one way of making Vapnik's idea into an algorithm for calculating 
nonconformity scores for all the examples in a bag l{xi,yi), . . . (a;„, ?/„)J. First 
plot all the examples as in Figure O Then find numbers a and b such that a < b 
and the interval [a,b] separates the two groups with the fewest mistakes — i.e., 
minimizef0 

< i < n,Xi < b, and J/i = v} + < i < n, Xi > a, and yi — s}. 

There may be many intervals that minimize this count; choose one that is widest. 
Then give the ith example the score 



When applied to the bags in Figure [6l this algorithm gives the circled examples 
the score oo and all the others the score 0. These scores are listed in the last 
two columns of Table [S] 

As we see from the table, the resulting p- values are Ps — 0.08 and — 1. 
So this time we obtain 92% confidence in 2/25 = v, with 100% credibility. 

The algorithm just described is too complex to implement when there are 
thousands of examples. For this reason, Vapnik and his collaborators proposed 
instead a quadratic minimization that balances the width of the separating band 
against the number and size of the mistakes it makes. Support- vector machines 
of this type have been widely used. They usually solve the dual optimization 
problem, and the Lagrange multipliers they calculate can serve as nonconformity 
scores. Implementations sometimes fail to treat the old examples symmetrically 
because they make various uses of the order in which examples are presented, 
but this difficulty can be overcome by a preliminary randomization ([28|, p. 58). 

A systematic comparison. The random sample of 25 plants we have con- 
sidered is odd in two ways: (1) except for the one versicolor with sepal length 
of only 5.0, the two species do not overlap in sepal length, and (2) the flower 
whose species we are trying to predict has a sepal that is unusually long for 
either species. 

In order to get a fuller picture of how the three nonconformity measures 
perform in general on the iris data, we have applied each of them to 1,000 

*Here we are implicitly assuming that the setosa flowers will be on the left, with shorter 
sepal lengths. A general algorithm should also check the possibility of a separation with the 
versicolor flowers on the left. 




00 if yi = V and Xi < a or yi — s and b < Xi 

1 if yi — Y and a < Xi < b or yi — s and a < Xi < b 
if yi — V and b < Xi or yi — s and a < Xi. 
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NN 


Species average 


SVM 


singleton hits 


164 


441 


195 


uncertain 


795 


477 


762 


total hits 


959 


918 


957 


empty 


9 


49 


1 


singleton errors 


32 


33 


42 


total errors 


41 


82 


43 


total examples 


1000 


1000 


1000 


% hits 


96% 


92% 


96% 


total singletons 


196 


474 


237 


% hits 


84% 


93% 


82% 


total errors 


41 


82 


43 


% empty 


22% 


60% 


2% 



Table 3: Performance of 92% prediction regions based on three non- 
conformity measures. For each nonconformity measure, we have found 1, 000 
prediction regions at the 92% level, using each time a different random sample 
of 25 from Anderson's 100 flowers. The "uncertain" regions are those equal to 
the whole label space, Y = {s,v}. 

different samples of size 25 selected from the population of Anderson's 100 
plants. The results are shown in Table [31 

The 92% regions based on the species average were correct about 92% of the 
time (918 times out of 1000), as advertised. The regions based on the other two 
measures were correct more often, about 96% of the time. The reason for this 
difference is visible in Table [21 the nonconformity scores based on the species 
average take a greater variety of values and therefore produce ties less often. 
The regions based on the species averages are also more efficient (smaller); 447 of 
its hits were informative, as opposed to fewer than 200 for each of the other two 
nonconformity measures. This efficiency also shows up in more empty regions 
among the errors. The species average produced an empty 92% prediction region 
for the random sample used in Table [21 and Table [31 shows that this happens 
5% of the time. 

As a practical matter, the uncertain prediction regions (r° "* — {s,v}) and 
the empty ones (r^"^ = 0) are equally uninformative. The only errors that 
mislead are the singletons that are wrong, and the three methods all produce 
these at about the same rate — 3 or 4%. 

4.3.2 Example: Predicting petal width from sepal length 

We now turn to the use of the conformal algorithm to predict a number. We use 
the same 25 plants, but now we use the data in the second and third columns of 
Table m the sepal length and petal width for the first 24 plants, and the sepal 
length for the 25th. Our task is to predict the petal width for the 25th. 

The most conventional way of analyzing this data is to calculate the least- 
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sepal length 


petal width 


Nearest neighbor 


Linear regression 


Zl 


5.0 


0.3 


0.3 


|0.003j/25 - 0.149| 


22 


4.4 


0.2 





|0.069j/25 + 0.050| 


23 


4.9 


0.2 


0.25 


|0. 014^25 — 0.199| 


24 


4.4 


0.2 





0.069:1/25 + 0.050 


25 


5.1 


0.4 


0.15 


|U.008j/25 + 0.099| 


26 


5.9 


1.5 


0.3 


|0.096j/25 - 0.603| 


27 


5.0 


0.2 


0.4 


10.0032/25 - 0.2491 


28 


6.4 


1.3 


0.2 


|0.151y25 - 0.154| 


29 


6.7 


1.4 


0.3 


|0.184y25 - 0.104| 


2lO 


6.2 


1.5 


0.2 


10.1292/25 - 0.453| 


211 


5.1 


0.2 


0.15 


10.0082/25 + 0.299| 


212 


4.6 


0.2 


0.05 


|0.047i/25 - 0.0501 


213 


5.0 


0.6 


0.3 


10.0032/25+0.151 


214 


5.4 


0.4 





|0.041y25 + 0.2481 


2l5 


5.0 


1.0 


0.75 


10.0032/25 + 0.551| 


216 


6.7 


1.7 


0.3 


10.1842/25 - 0.404| 


217 


5.8 


1.2 


0.2 


10.0852/25 - 0.3531 


218 


5.5 


0.2 


0.2 


10.0522/25 + 0.498| 


219 


5.8 


1.0 


0.2 


10.0852/25 - 0.153| 


220 


5.4 


0.4 





|0. 0412/25 + 0.2481 


221 


5.1 


0.3 





10.0082/25 + 0.1991 


222 


5.7 


1.3 


0.2 


10.0742/25 - 0.502| 


223 


4.6 


0.3 


0.1 


10.0472/25 + 0.050| 


224 


4.6 


0.2 


0.05 


10.0472/25 - 0.050| 


225 


6.8 


J/25 


|t/25 - 1.551 


10.8052/25 - 1.345| 



Table 4: Conformal prediction of petal width from sepal length. We 

use the same random 25 plants that we used for predicting the species. The 
actual value of j/25 is 1.4. 
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squares line ([T^ : 

V = a2i + b2ix -2.96 + 0.68a:. 

The sepal length for the 25th plant being 0:25 = 6.8, the hne predicts that 7/25 
should be near —2.96 + 0.68 x 6.8 = 1.66. Under the textbook assumption that 
the Ui are all independent and normally distributed with means on the line and 
a common variance, we estimate the common variance by 

si, = + ^ 0.0780. 

22 

The textbook 1 — e interval for j/25 based on "[{xi^yi), . . . , (2:24, 2/24)! and 2:25 is 
1.66 ± 4'^24 jl + ^ + J^/] ~ ^''^ ^ 1.66 ± 0.31l4^ (25) 



(01, p. 82; pp. 21-22; p. 145). Taking into account the fact 1/25 is 
measured to only one decimal place, we obtain [1.0,2.3] for the 96% interval 
and [1.1,2.2] for the 92% interval. 

The prediction interval (j25p is analogous to Fisher's interval for a new ex- 
ample from the same normally distributed population as a bag of old examples 
( §2.1.ip . In §5.3.21 we will review the general model of which both are special 
cases. 

As we will now see, the conformal algorithm under exchangeability gives 
confidence intervals comparable to (j25p , without the assumption that the errors 
are normal. We use two different nonconformity measures: one based on the 
nearest neighbor, and one based on the least-squares line. 

Conformal prediction using the nearest neighbor. Suppose _B is a bag of 
old examples and {x,y) is a new example, for which we know the sepal length 
X but not the petal width y. We can predict y using the nearest neighbor in 
an obvious way: We find the z' ^ B for which the sepal length x' is closest to 
X, and we predict that y will be the same as the petal width y' . If there are 
several examples in the bag with sepal length equally close to x, then we take the 
median of their petal widths as our predictor y. The associated nonconformity 
measure is \y — y\. 

The fourth column of Table [4] gives the nonconformity scores for our sample 
using this nonconformity measure. We see that a25 — I2/25 — 1.55|. The other 
nonconformity scores do not involve 2/25; the largest is 0.75, and the second 
largest is 0.40. So we obtain these prediction regions 1/25: 

• The 96% prediction region consists of all the y for which py > 0.04, which 
requires that at least one of the other be as large as 025, or that 
0.75 >\y~ 1.55|. This is the interval [0.8,2.3]. 

• The 92% prediction region consists of all the y for which py > 0.08, which 
requires that at least two of the other ai be as large as a25, or that 
0.40 >\y- 1.55|. This is the interval [1.2, 1.9]. 
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Conformal prediction using least-squares. Now we use the least-squares 
nonconformity measure with inclusion, given by (jisp . In our case, n = 25, so 
our nonconformity scores are 

(a25 + b25Xi)\ 

j=iyj ''X25)yj I Ej^i \ 

25 E-li(^.-^25)2 r 25 ) 

When we substitute values of Y.'jti Vji Y.'jtiixj -^25)%, Y.'jtii^] -^25)^, and 
J2'j=i calculated from Table IH this becomes 

a, = \yi + (0.553 - O.llOxj) y25 ~ 0.4982:^ + 2.04| . 

For i = 1, . . . , 24, we can further evaluate this by substituting the values of Xi 
and yi. For i — 25, we can substitute 6.8 for X25. These substitutions produce 
the expressions of the form \ciy 25 + di\ listed in the last column of Table ID 
We have made sure that Ci is always positive by multiplying by —1 within the 
absolute value when need be. 

Table [5] shows calculations required to find the conformal prediction region. 
The task is to identify, for i = 1, . . . , 24, the y for which \ciy + di\ > |0.805y — 
1.345|. We first find the solutions of the equation \ciy + di\ = |0.805y — 1.345|, 
which are 

+ 1.345 d, - 1.345 

~Ci- 0.805 ~ Q + 0.805 ' 

As it happens, Cj < 0.805 for i ~ 1, . . . ,24, and in this case the y satisfying 
\ciy + di\ > |0.805 — 1.345| form the interval between these two points. This 
interval is shown in the last column of the table. 

In order to be in the 96% interval, y must be in at least one of the 24 intervals 
in the table; in order to be in the 92% interval, it must be in at least two of 
them. So the 96% interval is [1.0,2.4], and the 92% interval is [1.0,2.3]. 

An algorithm for finding conformal prediction intervals using a least-squares 
or ridge-regression nonconformity measure with an object space of any finite 
dimension is spelled out on pp. 32-33 of [28]. 



4.4 Optimality 

The predictions produced by the conformal algorithm are invariant with respect 
to the old examples, correct with the advertised probability, and nested. As we 
now show, they are optimal among all region predictors with these properties. 
Here is more precise statement of the three properties: 

1. The predictions are invariant with respect to the ordering of the old ex- 
amples. Formally, this means that the predictor 7 is a function of two 
variables, the significance level e and the bag B of old examples. We write 
7'^(B) for the prediction, which is a subset of the example space Z. 
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Oi = CiJ/25 + di 


d + 1 345 


d — 1 345 


y satisfying 
\ciy + di\ > 
10.805 - 1.3451 




|0.003i/25 - 0.1491 


1.49 


1.85 


[1.49,1.85] 


«2 


0.069J/25 + 0.050 


1.90 


1.48 


[1.48,1.90] 


23 


0.01%25 - 0.199 


1.45 


1.89 


[1.45,1.89] 


24 


0.0691/25 + 0.050 


1.90 


1.48 


[1.48,1.90] 


25 


0. 008^25 + 0.099 


1.81 


1.53 


[1.53,1.81] 


26 


|0.096t/25 - 0.603| 


1.05 


2.16 


[1.05,2.16] 


27 


0.003y25 - 0.249 


1.37 


1.97 


[1.37,1.97] 


28 


0.151:y25 -0.154 


1.82 


1.57 


[1.57,1.82] 


29 


0.184J/25 -0.104 


2.00 


1.47 


[1.47,2.00] 


210 


0.129y25 - 0.453 


1.32 


1.93 


[1..32,1.93] 


211 


10.0082/25 + 0.299| 


2.06 


1.29 


[1.29,2.06] 


212 


|0.047y25 - 0.0501 


1.71 


1.64 


[1.64,1.71] 


213 


0. 003^25 + 0.151 


1.87 


1.48 


[1.48,1.87] 


214 


0.041J/25 + 0.248 


2.09 


1.30 


[1.30,2.09] 


216 


0.003y25 + 0.551 


2.36 


0.98 


[0.98,2.36] 


216 


|0.184?/25 - 0.404| 


1.52 


1.77 


[1.52,1.77] 


217 


0.085J/25 - 0.353 


1.38 


1.91 


[1.38,1.91] 


218 


0.052-(/25 + 0.498 


2.45 


0.99 


[0.99,2.45] 


219 


0.085t/25 - 0.153 


1.66 


1.68 


[1.66,1.68] 


220 


0.0413/25 + 0.248 


2.09 


1.30 


[1.30,2.09] 


221 


|0.008j/25 + 0.1991 


1.94 


1.41 


[1.41,1.94] 


222 


0.074y25 - 0.502 


1.15 


2.10 


[1.15,2.10] 


223 


0.047y25 + 0.050 


1.84 


1.52 


[1.52,1.84] 


224 


0.0473/25 - 0.050 


1.71 


1.64 


[1.64,1.71] 


225 


|0.805j/25 - 1.345| 









Table 5: Calculations with least-squares nonconformity scores. The 
column on the right gives the values of y for which the example's nonconformity 
score will exceed that of the 25th example. 

Least-squares Conformal prediction with two 

prediction with different nonconformity measures 
normal errors NN Least squares 

~96% [1.0,2.3] [0.8,2.3] [1.0,2.4] 

92% [1.1,2.2] [1.2,1.9] [1.0,2.3] 

Table 6: Prediction intervals for the 25th plant's petal width, calcu- 
lated by three different methods. The conformal prediction intervals using 

the least-squares nonconformity measure arc quite close to the standard inter- 
vals based on least-squares with normal errors. All the intervals contain the 
actual value, 1.4. 
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2. The probability of a hit is always at least the advertised confidence level. 
For every positive integer n and every probability distribution under which 
zi, . . . ,Zn are exchangeable, 

Pr{z„ £j'{lzi,...,zn-i\)} > 1-e. 

3. The prediction regions are nested. If ei > £2, then j'^^{B) C ^'^^[B). 

Conformal predictors satisfy these three conditions. Other region predictors 
can also satisfy them. But as we now demonstrate, any 7 satisfying them can 
be improved on by a conformal predictor: there always exists a nonconformity 
measure A such that the predictor 7^ constructed from A by the conformal 
algorithm satisfies 7^1 (S) ^ l'^{B) for all B and e. 

The key to the demonstration is the following lemma: 

Lemma 1 Suppose j is a region predictor satisfying the three conditions, 
lai, . . . , flnj is a bag of examples, and < e < 1. Then ne or fewer of the n 
elements of the bag satisfy 

a, ^7^aai,...,a„niaJ)- (26) 

Proof Consider the unique exchangeable probability distribution for zi , . . . , z„ 
that gives probability 1 to . . . , z„J = |ai, . . . , a„ j. Under this distribution, 
each element of |ai, . . . , a„j has an equal probability of being z„, and in this 
case, ((26|) is a mistake. By the second condition, the probability of a mistake 
is e or less. So the fraction of the bag's elements for which (p6|) holds is e or 
less. I 

Given the region predictor 7, what nonconformity measure will give us a 
conformal predictor that improves on it? If 

z^j'iB), (27) 

then 7 is asserting confidence 1 — d that z should not appear next because it 
is so different from B. So the largest 1 — S for which ([77]) holds is a natural 
nonconformity measure: 

A{B,z)=snp{l-S\z(^j\B)}. 

The conformal predictor ja obtained from this nonconformity measure, though 
it agrees with 7 on how to rank different z with respect to their nonconformity 
with B, may produce tighter prediction regions if 7 is too conservative in the 
levels of confidence it asserts. 

To show that 7^(-B) Q l'^{B) for every e and every B, we assume that 

2e7A(Izi,---,Zn-iI) (28) 

and show that z G 7'^(|zi, . . . ,z„_ij). According to the conformal algorithm, 
(j28p means that when we provisionally set Zn equal to z and calculate the 
nonconformity scores 

ai = sup{l -5\zi^ l^ilzi, ...,Zn]\l Zi\)} 
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for i = 1, . . . ,rt, we find that strictly more than ne of these scores are greater 
than or equal to Q!„. Because 7's prediction regions are nested (condition 3), 
it follows that if z„ ^ j^dzi,. . . , 2;„_i j), then Zi ^ T^d^i, . . . ,Zn \ \l Zi\) for 
strictly more than ne of the Zi. But by Lemma 1, ne or fewer of the Zi can 
satisfy this condition. So z„ G j'^dzi, . . . , Zn-i\)- 

There are sensible reasons to use region predictors that are not invariant. 
We may want to exploit possible departures from exchangeability even while 
insisting on validity under exchangeability. Or it may simply be more practical 
to use a predictor that is not invariant. But invariance is a natural condition 
when we want to rely only on exchangeability, and in this case our optimality 
result is persuasive. For further discussion, see §2.4 of [2^. 

4.5 Examples are seldom exactly exchangeable. 

Although the assumption of exchangeability is weak compared to the assump- 
tions embodied in most statistical models, it is still an idealization, seldom 
matched exactly by what we see in the world. So we should not expect conclu- 
sions derived from this assumption to be exactly true. In particular, we should 
not be surprised if a 95% conformal predictor is wrong more than 5% of the 
time. 

We can make this point with the USPS dataset so often used to illustrate 
machine learning methods. This dataset consists of 9298 examples of the form 
{x,y), where a; is a 16 x 16 gray-scale matrix and y is one of the ten digits 
0, 1, . . . , 9. It has been used in hundreds of books and articles. In [2^, it is 
used to illustrate conformal prediction with a number of different nonconformity 
measures. It is well known that the examples in this dataset are not perfectly 
exchangeable. In particular, the first 7291 examples, which are often treated as 
a training set, are systematically different in some respects from the remaining 
2007 examples, which are usually treated as a test set. 

Figure [7] illustrates how the non-exchangeability of the USPS data affects 
conformal prediction. The figure records the performance of the 95% conformal 
predictor using the nearest-neighbor nonconformity measure (jl3[) , applied to the 
USPS data in two ways. First we use the 9298 examples in the order in which 
they are given in the dataset. (We ignore the distinction between training and 
test examples, but since the training examples are given first we do go through 
them first.) Working through the examples in this order, we predict each y„ 
using the previous examples and x„. Second, we randomly permute all 9298 
examples, thus producing an order with respect to which the examples are 
necessarily exchangeable. The law of large numbers works when we go through 
the examples in the permuted order: we make mistakes at a steady rate, about 
equal to the expected 5%. But when we go through the examples in the original 
order, the fraction of mistakes is less stable, and it worsens as we move into the 
test set. As Table [7] shows, the fraction of mistakes is 5%, as desired, in the first 
7291 examples (the training set) but jumps to 8% in the last 2007 examples. 

Non-exchangeability can be tested statistically, using conventional or game- 
theoretic methods (see §7.1 of [l^). In the case of this data, any reasonable test 
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Figure 7: Errors in 95% nearest-neighbor conformal prediction on the 
classical USPS dataset. When the 9298 examples are predicted in a randomly 
chosen order, so that the exchangeability assumption is satisfied for sure, the 
error rate is approximately 5% as advertised. When they are taken in their 
original order, first the 7291 in the training set, and then the 2007 in the test 
set, the error rate is higher, especially in the test set. 
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Original data Permuted data 





Training 


Test 


Total 


Training 


Test 


Total 


singleton hits 


6798 


1838 


8636 


6800 


1905 


8705 


uncertain hits 


111 





111 


123 





123 


total hits 


6909 


1838 


8747 


6923 


1905 


8828 


empty 


265 


142 


407 


205 


81 


286 


singleton errors 


102 


27 


129 


160 


21 


181 


uncertain errors 


15 





15 


3 





3 


total errors 


382 


169 


551 


368 


102 


470 


total examples 


7291 


2007 


9298 


7291 


2007 


9298 


% hits 


95% 


92% 


94% 


95% 


95% 


95% 


total singletons 


6900 


1865 


8765 


6960 


1926 


8880 


% hits 


99% 


99% 


99% 


98% 


99% 


98% 


total uncertain 


126 





126 


126 





126 


% hits 


82% 




82% 


98% 




98% 


total errors 


382 


169 


551 


368 


102 


470 


% empty 


69% 


85% 


74% 


57% 


79% 


61% 



Table 7: Details of the performance of 95% nearest-neighbor conformal 
prediction on the classical USPS dataset. Because there are 10 labels, the 
uncertain predictions, those containing more than one label, can be hits or 
errors. 

will reject exchangeability decisively. Whether the deviation from exchangeabil- 
ity is of practical importance for prediction depends, of course, on circumstances. 
An error rate of 8% when 5% has been promised may or may not be acceptable. 

5 On-line compression models 

In this section, we generalize conformal prediction from the exchangeability 
model to a whole class of models, which we call on-line compression models. 

In the exchangeability model, we compress or summarize examples by omit- 
ting information about their order. We then look backwards from the summary 
(the bag of unordered examples) and give probabilities for the different orderings 
that could have produced it. The compression can be done on-line: each time 
we see a new example, we add it to the bag. The backward-looking probabili- 
ties can also be given step by step. Other on-line compression models compress 
more or less drastically but have a similar structure. 

On-line compression models were studied in the 1970s and 1980s, under 
various names, by Per Martin-L6f [l3| , Steffen Lauritzen [l^, and Eugene Asarin 
P, Q. Different authors had different motivations. Lauritzen and Martin-L6f 
started from statistical mechanics, whereas Asarin started from Kolmogorov's 
thinking about the meaning of randomness. But the models they studied all 
summarize past examples using statistics that contain all the information useful 
for predicting future examples. The summary is updated each time one observes 
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a new example, and the probabilistic content of the structure is expressed by 
Markov kernels that give probabilities for summarized examples conditional on 
the summaries. 

In general, a Markov kernel is a mapping that specifies, as a function of 
one variable, a probability distribution for some other variable or variables. A 
Markov kernel for w given u, for example, gives a probability distribution for 
w for each value of u. It is conventional to write P{w\u) for this distribution. 
We are interested in Markov kernels of the form P(zi, . . . , z„|cr„), where cr„ 
summarizes the examples zi , . . . , z„ . Such a kernel gives probabilities for the 
different zi , . . . , z„ that could have produced (j„ . 

Martin-L6f, Lauritzen, and Asarin were interested in justifying widely used 
statistical models from principles that seem less arbitrary than the models them- 
selves. On-line compression models offer an opportunity to do this, because they 
typically limit their use of probability to representing ignorance with a uniform 
distribution but lead to statistical models that seem to say something more. 
Suppose, for example, that Joe summarizes numbers zi, . . . , z„ by 

^ n n 

z= — > Zi and r^=> (z^ — z)^ 
n ^-^ ^-^ 

i=l i=l 

and gives these summaries to Bill, who does not know zi, . . . ,z„. Bill might 
adopt a probability distribution for zi , . . . , z„ that is uniform over the possibil- 
ities, which form the surface of the n-dimensional sphere of radius r centered 
around (z, . . . , z). As we will see in ii5.3,2 [ this is an on-line compression model. 
It was shown, by Freedman and Smith ([28|. p. 217) and then by Lauritzen ([l6l|. 
pp. 238-247), that if we assume this model is valid for all n, then the distribu- 
tion of zi, Z2, . . . must be a mixture of distributions under which zi, Z2, . . . are 
independent and normal with a common mean and variance. This is analogous 
to de Finetti's theorem, which says that if zi, . . . ,z„ are exchangeable for all 
n, then the distribution of zi, Z2, . . . must be a mixture of distributions under 
which zi, Z2, . . . are independent. 

For our own part, we are interested in using an on-line compression model 
directly for prediction rather than as a step towards a model that specifies prob- 
abilities for examples more fully. We have already seen how the exchangeability 
model can be used directly for prediction: we establish a law of large numbers 
for backward-looking probabilities ( §3.4p . and we use it to justify confidence in 
conformal prediction regions f ij4.2p . The argument extends to on-line compres- 
sion models in general. 

For the exchangeability model, conformal prediction is optimal for obtaining 
prediction regions ( §4.4|) . No such statement can be made for on-line compres- 
sion models in general. In fact, there are other on-line compression models in 
which conformal prediction is very inefficient ([1^, p. 220). 

After developing the general theory of conformal prediction for on-line com- 
pression models f i)5.1l and i )5.2p . we consider two examples: the exchangeability- 
within-label model ( §5.3.ip and on-line Gaussian linear model ( §5.3.2p . 
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5.1 Definitions 

A more formal look at the exchangeability model will suffice to bring the general 
notion of an on-line compression model into focus. 

In the exchangeability model, we summarize examples simply by omitting 
information about their ordering; the ordered examples are summarized by a 
bag containing them. The backward-looking probabilities are equally simple; 
given the bag, the different possible orderings all have equal probability, as if 
the ordering resulted from drawing the examples successively at random from 
the bag without replacement. Although this picture is very simple, we can 
distinguish four distinct mathematical operations within it: 

1. Summarizing. The examples zi,...,z„ are summarized by the bag 
Izi, . . . We can say that the summarization is accomplished by a 

summarizing function I]„ that maps an n-tuple of examples (zi, . . . , z„) 
to the bag containing these examples: 



We write tT„ for the summary — i.e., the bag Izi, . . . , z„j. 

2. Updating. The summary can be formed step by step as the examples are 
observed. Once you have the bag containing the first n — 1 examples, you 
just add the nth. This defines an updating function t/„((T, z) that satisfies 



The top panel in Figure [8] depicts how the summary cr„ is built up step 
by step from zi , . . . , z„ using the updating functions C/i ,...,[/„ . First 

(Ji = [/i(n,zi), where □ is the empty bag. Then a2 = t/2(ci,22), and so 
on. 

3. Looking back all the way. Given the bag tT„, the n\ different orderings 
of the elements of the bag are equally likely, just as they would be if we 
ordered the contents of the bag randomly. As we learned in §3.21 we can 
say this with a formula that takes explicit account of the possibility of 
repetitions in the bag: the probability of the event {zi — ai, . . . , Zn — an} 
is 



where k is the number of distinct elements in cr„ , and ni , . . . , are the 
numbers of times these distinct elements occur. We call Pi, P2, ■ • ■ the full 
kernels. 

4. Looking hack one step. We can also look back one step. Given the bag cr„, 
what are the probabilities for z„ and cr„_i? They are the same as if we 
drew Zn out of cr„ at random. In other words, for each z that appears in 
(T„, there is a probability fc/n, where k is the number of times z appears 



'^n(zi, . . . , 



Zn) ■■= Izi, 



Zn I • 



. . . , Zn) — C/n(Sn_i(zi, . . . , Z„_i), Z„). 




(29) 
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in (T„, that (1) Zn — z and (2) (t„_i is the bag obtained by removing one 
instance of z from (t„. The kernel defined in this way is represented by 
the two arrows backward from cr„ in the bottom panel of Figure [S] Let us 
designate it by We similarly obtain a kernel backward from (7„_i 
and so on. These are the one-step kernels for the model. We can obtain 
the full kernel P„ by combining the one-step kernels i?„, i?„_i, . . . , 
This is most readily understood not in terms of formulas but in terms of 
a sequence of drawings whose outcomes have the probability distributions 
given by the kernels. The drawing from (T„ (which goes by the probabilities 
given by i?„(-|(T„)) gives us z„ and (7„_i, the drawing from (T„_i (which 
goes by the probabilities given by i?„_i(-|cr„_i)) gives us Zn-i and ct„_2, 
and so on; we finally obtain the whole random sequence zi, . . . , z„, which 
has the distribution /'n(-|CT„). This is the meaning of the bottom panel in 
Figure [H 

All four operations are important. The second and fourth, updating and looking 
back one step, can be thought of as the most fundamental, because we can de- 
rive the other two from them. Summarization can be carried out by composing 
updates, and looking back all the way can be carried out by composing one-step 
look-backs. Moreover, the conformal algorithm uses the one-step back proba- 
bilities. But when we turn to particular on-line compression models, we will 
find it initially most convenient to describe them in terms of their summarizing 
functions and full kernels. 

In general, an on-line compression model for an example space Z consists of 
a space S, whose elements we call summaries^ and two sequences of mappings: 

• Updating functions Ui^U^, ■ ■ ■ ■ The function J7„ maps a summary s and 
an example z to a new summary Un{s,z). 

• One-step kernels Ri, R2, ■ ■ ■ ■ For each summary s, the kernel i?„ gives a 
joint probability distribution Rn{s' , z\s) for an unknown summary s' and 
unknown example z. We require that i?„(-|s) give probability one to the 
set of pairs (s', z) such that J7„_i(s', z) = s. 

We also require that the summary space S include the empty summary □. 

The recipes for constructing the summarizing functions Si, E2, . . . and the 
full kernels Pi, P2, . . . are the same in general as in the exchangeability model: 

• The summary (t„ — S„(zi, . . . , z„) is built up step by step from zi, . . . , z„ 
using the updating functions. First ai — ?7i(n,zi), then — U2{cfi, Z2), 
and so on. 

• We obtain the full kernel P„ by combining, backwards from (t„ , the random 
experiments represented by the one-step kernels i?„, Rn~i, ■ ■ ■ ,Ri- First 
we draw z„ and an-i from i?„(-|tT„), then we draw Zn-i and cr„_2 from 
Rn-l{•\o^n-l), and so on. The sequence zi, . . . ,z„ obtained in this way 
has the distribution P,i(-|o-„). 
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Zl Z2 Zn-1 Zn 



□ ►- (71 «- (72 ►- • • • «- Cn-i K an 



Updating. We speak of "on-line" compression models because the summary can 
be updated with each new example. In the case of the exchangeability model, 
we obtain the bag by adding the new example Zi to the old bag ai-i. 



Zl Z2 Zn~l Zn 



□ -< CTl -< tT2 1 • ■ • -< Cr„_i (T„ 



Backward probabilities. The two arrows backwards from ai symbolize our proba- 
bilities, conditional on (Ji, for what example Zi and what previous summary f7i_i 
were combined to produce ai. Like the diagram in Figure [3] that it generalizes, 
this diagram is a Bayes net. 



Figure 8: Elements of an on-line compression model. The top diagram 
represents the updating functions t/i, . . . , C/„. The bottom diagram represents 
the one-step kernels . . . , 
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On-line compression models are usually initially specified in terms of their 
summarizing functions I]„ and their full kernels P„, because these are usually 
easy to describe. One must then verify that these easily described objects do 
define an on-line compression model. This requires verifying two points: 

1. Si, S2, . . . can be defined successively by means of updating functions: 

i;„(zi, . . . ,Z„) = t/„(I]„_i(zi, . . . , Z„_i), Zn). (30) 

In words: ct„ depends on zi, . . . , z„-i only through the earlier summary 

2. Each Pn can be obtained as required using one-step kernels. One way 
to verify this is to exhibit the one-step kernels Ri, . . . , i?„ and then to 
check that drawing Zn and <Tti-i from i?„(-|(Tn), then drawing z„_i and 
cr„_2 from i?„_i(-|CT„_i), and so on produces a sequence zi, . . . , z„ with 
the distribution P„(-|crn)- Another way to verify it, without necessarily 
exhibiting the one-step kernels, is to verify the conditional independence 
relations represented by Figure [S] z„ (and hence also (T„) is probabilisti- 
cally independent of zi, . . . , z„_i given cr„_i. 

5.2 Conformal prediction 

In the context of an on-line compression model, a nonconformity measure is 
an arbitrary real- valued function A(a,z), where ct is a summary and z is an 
example. We choose A so that A{a, z) is large when z seems very different from 
the examples that might be summarized by a. 

In order to state the conformal algorithm, we write i7„_i and z„ for ran- 
dom variables with a joint probability distribution given by the one-step kernel 
i?(-|i7r!,). The algorithm using old examples alone can then be stated as follows: 



The Conformal Algorithm Using Old Examples 


Alone 




Input: Nonconformity measure A, significance level e 


examples zi , . 


. , Zn-l, 


example z 






Task: Decide whether to include z in 7'^(zi, . . . , Zn_i). 






Algorithm: 






1. Provisionally set z„ := z. 






2. Set Rn{A{dn-l,Zn) > ^((T„_i,Z„)|(T„). 






3. Include z in 7'^(zi, . . . , z„_i) if and only if pz > f. 







To see that this reduces to the algorithm we gave for the exchangeability 
model on p. [iTl recall that cr„ = |^z . . . , z„j and (T„_i = |^zi, . . . , z„ j \ ^ Znj in 
that model, so that 

A{an-l,Zn) A{lzi, . . . ,Zn']\l Z^J, Z„) (31) 
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and 

A(ct„_i, z„) = A{lzi, z„_i j, z„) (32) 

Under i?„(' | l-^i, ■ • ■ , -^ni), the random variable z„ has equal chances of being 
any of the z^, so that the probability of (PT|) being greater than or equal to ([5^ 
is simply the fraction of the Zi for which 

A{lzi,. . . ,Zn'i\lZil,Zi) > A(|zi,...,Z„_lj,Z„), 

and this is how pz is defined on p. 1171 

Our arguments for the validity of the regions 7*^(21, z„_i) in the ex- 
changeability model generalize readily. The definitions of n-event and e-rare 
generalize in an obvious way: 

• An event E is an n-event if its happening or failing is determined by the 
value of Zn and the value of the summary cr„-i. 

• An n-event E is e-rare if Rn{E \un) < £• 

The event z„ ^ 7*^(21, ... , Zn-i) is an n-event, and it is e-rare (the probability 
is e or less that a random variable will take a value that it equals or exceeds 
with a probability of e or less). So working backwards from the summary ctat 
for a large value of N, Bill can still bet against the errors successively at rates 
corresponding to their probabilities under (t„, which are always e or less. This 
produces an exact analog to Informal Proposition [TJ 

Informal Proposition 2 Suppose N is large, and the variables zi,...,zjv 
obey an on-line compression model. Suppose En is an e-rare n-event for 
n = 1,...,N. Then the law of large numbers applies; with very high proba- 
bility, no more than approximately the fraction e of the events Ei, . . . , Ej^ will 
happen. 

The conformal algorithm using features of the new example generalizes sim- 
ilarly: 



The Conformal Algorithm 










Input: Nonconformity measure 


A, 


significance level e examples 21 , . 


• 1 ^n—l t 


object Xn, label y 










Task: Decide whether to include 


y in r^izi, . 


■ • ■} Zn—l 1 ^n) • 




Algorithm: 










1. Provisionally set z„ := (a;„ 


y) 








2. Set Py := RniA{d-n-i,Zn) > ^(o-„-i,z„ 






3. Include y in r'(zi, . . . , z„_ 


i,x 


„) if and 


only if Py > e. 





The validity of this algorithm follows from the validity of the algorithm using 
old examples alone by the same argument as in the case of exchangeability. 
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5.3 Examples 

We now look at two on-line compression models: the exchangeability-within- 
label model and the on-line Gaussian linear model. 

The exchangeability- within-label model was first introduced in work leading 
up to our monograph [28| . It weakens the assumption of exchangeability. 

The on-line Gaussian linear model, as we have already mentioned, has been 
widely studied. It overlaps the exchangeability model, in the sense that the 
assumptions for both of the models can hold at the same time, but the assump- 
tions for one of them can hold without the assumptions for the other holding. 
It is closely related to the classical Gaussian linear model. Conformal predic- 
tion in the on-line model leads to the same prediction regions that are usually 
used for the classical model. But the conformal prediction theory adds the new 
information that these intervals are valid in the sense of this article: they are 
right 1 — e of the time when used on accumulating data. 

5.3.1 The exchangeability-within-label model 

The assumption of exchangeability can be weakened in many ways. In the case 
of classification, one interesting possibility is to assume only that the examples 
for each label are exchangeable with each other. For each label, the objects with 
that label are as likely to appear in one order as in another. This assumption 
leaves open the possibility that the appearance of one label might change the 
probabilities for the next label. 

Suppose the label space has k elements, say Y = {1, . . . , fc}. Then we can 
define the exchangeability-within-label model as follows: 

Summarizing Functions The nth summarizing function is 

E„(zi,...,2„) := (yi,...,y„,SJ\...,B^), (33) 

where B" is the bag consisting of the objects in the list xi, . . . ,Xn that 
have the label j. 

Full Kernels The full kernel P„(zi, . . . , z„ |yi, . . . , ?;„, B", . . . , S^?) is most eas- 
ily described in terms the random action for which it gives the probabili- 
ties: independently for each label j, distribute the objects in the bag i?" 
randomly among the positions i for which yi is equal to j. 

To check that this is an on-line compression model, we exhibit the updating 
function and the one-step kernels: 

Updating When (a;„,y„) is observed, the summary 

(2/i,...,y„_i,Sr\...,Sr') 
is updated by inserting yn after yn~i and adding Xn to B"^^. 
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One step back The one-step kernel i?„ is given by 



i?„ (summary, {x,y) \ yi, . . . , yn,B''l, . . . , = I I 

I otherwise, 

where k is the number of xs in By^ . This is the same as the probabihty 
the one-step kernel for the exchangeability model without objects would 
give for x on the basis of a bag of size \By^ \ that includes k xs. 

Because the true labels are part of the summary, our imaginary bettor Bill 
can choose to bet just on those rounds of his game with Joe where the label has 
a particular value, and this implies that a 95% conformal predictor under the 
exchangeability-within-label model will make errors at no more than a 5% rate 
for examples with that label. This is not necessarily true for a 95% conformal 
predictor under the exchangeability model; although it can make errors no more 
than about 5% of the time overall, its error rate may be higher for some labels 
and lower for others. As Figure [9] shows, this happens in the case of the USPS 
dataset. The graph in the top panel of the figure shows the cumulative errors 
for examples with the label 5, which is particularly easy to confuse with other 
digits, when the nearest-neighbor conformal predictor is applied to that data 
in permuted form. The error rate for 5 is over 11%. The graph in the bottom 
panel shows the results of the exchangeability-within-label conformal predictor 
using the same nearest-neighbor nonconformity measure; here the error rate 
stays close to 5%. As this graph makes clear, the predictor holds the error rate 
down to 5% in this case by producing many prediction regions containing more 
than one label ( "uncertain predictions" ) . 

As we explain in §4.5 and §8.4 of 28], the exchangeability-within-label model 
is a Mondrian model. In general, a Mondrian model decomposes the space Z x N, 
where N is set of the natural numbers, into non-overlapping rectangles, and it 
asks for exchangeability only within these rectangles. For each example Zi, 
it then records, as part of the summary, the rectangle into which {zi,i) falls. 
Mondrian models can be useful when we need to weaken the assumption of 
exchangeability. They can also be attractive even if we are willing to assume 
exchangeability across the categories, because the conformal predictions they 
produce will be calibrated within categories. 



5.3.2 The on-line Gaussian linear model 



Consider examples zi, . . . , z^r, of the form Zn = {xn, y-n), where yn is a number 
and Xn is a row vector consisting of p numbers. For each n between 1 and N, 
set 



Xi 



and 



:= 



yi 



Vn 



Thus Xn is an n X p matrix, and Yn is a column vector of length n. 
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Figure 9: Errors for 95% conformal prediction using nearest neighbors 
in the permuted USPS data when the true label is 5. In both figures, the 
dotted hue represents the overall expected error rate of 5%. The actual error 
rate for 5s with the exchangeability-within-label model tracks this line, but with 
the exchangeability model it is much higher. The exchangeability-within-label 
predictor keeps its error rate down by issuing more prediction regions containing 
more than one digit ("uncertain predictions"). 
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In this context, the on-line Gaussian linear model is the on-line compression 
model defined by the following summarizing functions and full kernels; 

SummEirizing Functions The nth summarizing function is 

/ 

\ i=l 




(34) 



Full Kernels The full kernel P„(^i, . . . , 2;„ | cr„) distributes its probability uni- 
formly over the space of vectors (yi, . . . , y„) consistent with the summary 
cr„. (We consider probabilities only for yi, . . . , y„, because Xi, . . . , x„ are 
fixed by cr„.) 

We can write (7„ = (X„,C, r^), where C is a column vector of length p, and r 
is a nonnegative number. A vector (j/i, . . . ,t/„) is consistent with (T„ if 



^yjXj=C and ^V^j = 



This is the intersection of a hyperplane with the surface of a sphere. Not being 
empty, the intersection is either a point (in the exceptional case where the 
hyperplane is tangent to the sphere) or the surface of a lower-dimensional sphere. 
(Imagine intersecting a plane and the surface of a 3-dimensional sphere; the 
result is a circle, the surface of a 2-dimensional sphere.) The kernel P„(- | cr„) 
puts all its probability on the point or distributes it uniformly over the surface 
of the lower-dimensional sphere. 

To see that the summarizing functions and full kernels define an on-line com- 
pression model, we must check that the summaries can be updated and that the 
full kernels have the required conditional independence property: conditioning 
■?«(• I Cn) on Zi+\, ■ ■ ■ ,Zn gives Pi{-\ai). (We do not condition on at since it can 
be computed from Zi+i, ■ . ■ ,Zn and ct„.) Updating is straightforward; when we 
observe {xn,yn), we update the summary 



Xi, . . . , Xn—l, ^ ] ViXi, ^ ] J/j 



by inserting x„ after Xn-i and adding a term to each of the sums. To 
see that conditioning | a,,) on Zj+i,...,z„ gives Pj(- | CTj), we note that 

conditioning the uniform distribution on the surface of a sphere on values 
= Oi+i, . . . , y„ = a„ involves intersecting the surface with the hyperplanes 
defined by these n — i equations. This produces the uniform distribution on the 
surface of the possibly lower-dimensional sphere defined by 

in in 
j=l j=i+l j=l j=i+i 
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this is indeed Pi{yi, . . . ,y.i\ at). 

The on-Une Gaussian linear model is closely related to the classical Gaussian 
linear model. In the classical modellf] 

Ui = XiP + Ci, (35) 

where the xi are row vectors of known numbers, /3 is a column vector of unknown 
numbers (the regression coefficients), and the are independent of each other 
and normally distributed with mean zero and a common variance. When n — l> 
p and Rank(X„_i) = p, the theory of the classical model tells us the following: 

• After observing examples {xi,yi, . . . ,Xn^i,yn-i), estimate the vector of 
coefficients /3 by 

and after further observing a;„, predict by 

Vn '■= XnPn-l — ^^n (-'^'i- l-'^n- 1 ) "'"-'^'1- l^n- 1 • 

• Estimate the variance of the by 

2 ^n-l^ri-l — Pn-l^n-l^n-l 



Sr 



-1 



p-1 



• The random variable 



tn := ^" ^" (36) 

Sn-l Vl + ^'ni^'n-l^ n — l) -^n 

has a i-distribution with n — p — I degrees of freedom, and so 



./l + <(x;_iX„_i)-ix„ (37) 

has probability 1 — e of containing j/„ ({2T], p. 127; [22], p. 132). 

The assumption Rank(X„_i) = p can be relaxed, at the price of complicating 
the formulas involving (X^_]^X„_i)~^. But the assumption n—1 > Rank(X„_i) 
is essential to finding a prediction interval of the type (|37p : when it fails there 
are values for the coefficients /3 such that y-n-i = and consequently 

there is no residual variance with which to estimate the variance of the e^. 

We have already used two special cases of (l37l) in this article. Formula H]) 
in §2.1. H is the special case with p — 1 and each Xi equal to 1, and formula (I25p 
at the beginning of §4.3.21 is the special case with p = 2 and the first entry of 
each Xi equal to 1. 

The relation between the classical and on-line models, fully understood in 
the theoretical literature since the 1980s, can be summarized as follows: 



^There are many names for the classical model. The name "classical Gaussian linear 
model" is used by Bickel and Doksum 3], p. 366. 
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• If zi , . . . , zjv satisfy the assumptions of the classical Gaussian linear model, 
then they satisfy the assumptions of the on-line Gaussian linear model. In 
other words, the assumption that the errors in are independent and 
normal with mean zero and a common variance implies that conditional 
on X!^Yn = C and Y^Yn = r^, the vector y„ is distributed uniformly over 
the surface of the sphere defined by C and . This was already noted by 
R. A. Fisher in 1925 0]. 

• The assumption of the on-line Gaussian linear model, that conditional on 
X!^Yn = C and Y^Yn = r^, the vector 1^ is distributed uniformly over 
the surface of the sphere defined by C and r^, is sufficient to guarantee 
that has the t-distribution with n — p — I degrees of freedom Q . 

• Suppose zi,Z2,--- is an infinite sequence of random variables. Then 
zi,. . . ,Z]\[ satisfy the assumptions of the on-line Gaussian linear model 
for every integer TV if and only if the joint distribution of zi, Z2, . . . is a 
mixture of distributions given by the classical Gaussian linear model, each 
model in the mixture possibly having a different (3 and a different variance 
for the Ci 0. 

A natural nonconformity measure A for the on-line Gaussian linear model 
is given, for a = {X, X'Y, Y'Y) and z = {x, y), by 

A(<7,z):^\y-yl (38) 

where y = x{X'X)-^X'Y. 

Proposition 1 When i38\) is used as the nonconformity measure, the 1 — e 
conformal prediction region for ?/„ is ^37\ ), the interval given by the t- distribution 
in the classical theory. 

Proof When p8|) is used as the nonconformity measure, the test statistic 
A{an-i, Zn) used in the conformal algorithm becomes |?/„ — The confor- 
mal algorithm considers the distribution of this statistic under i?„(- | cr„). But 
when CT„ is fixed and t„ is given by (1361) . |t„| is a monotonically increasing func- 
tion of \yn — yn\ (see pp. 202-203 of [28| for details). So the conformal prediction 
region is the interval of values of ?/„ for which |i„| does not take its most extreme 
values. Since i„ has the ^-distribution with n — p — 1 degrees of freedom under 
Rn{- I fn), this is the interval ([57]) . I 

Together with Informal Proposition [2l Proposition [1] implies that when we 
use (1371) for a large number of successive values of n, ?/„ will be in the interval 
1 — e of the time. In fact, because the probability of error each time is exactly 
e, we can say simply that the errors are independent and for this reason the 
classical law of large numbers applies. 

In our example involving the prediction of petal width from sepal length, the 
exchangeability and Gaussian linear models gave roughly comparable results 
(see Table [5] in iJ4.3.2p . This will often be the case. Each model makes an 
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assumption, however, that the other does not make. The exchangeabiUty model 
assumes that the xs, as well as the ys, are exchangeable. The Gaussian linear 
model assumes that given the xs, the ys are normally distributed. 
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The main purpose of this appendix is to formalize and prove the following 
informal proposition: 

Informal Proposition 1 Suppose N is large, and the variables zi, . . . , zjv are 

exchangeable. Suppose En is an e-rare n-event for n = I, . . . , N . Then the law 
of large numbers applies; with very high probability, no more than approximately 
the fraction e of the events Ei, . . . , En will happen. 

We used this informal proposition in i )3.4l to establish the validity of conformal 
prediction in the exchangeability model. As we promised then, we will discuss 
two different approaches to formalizing it: a classical approach and a game- 
theoretical approach. The classical approach shows that the £"„ are mutually 
independent in the case where they are exactly e-rare and then appeals to the 
classical weak law of large numbers for independent events. The game-theoretic 
approach appeals directly to the more flexible game-theoretic weak law of large 
numbers. 

Our proofs will also establish the analogous Informal Proposition [21 which 
we used to establish the validity of conformal prediction in on-line compression 
models in general. 

In ^A.3[ we return to R. A. Fisher's prediction interval for a normal random 
variable, which we discussed in i J2.1.1l We show that this prediction interval's 
successive hits are independent, so that validity follows from the usual law of 
large numbers. Fisher's prediction interval is a special case of conformal predic- 
tion for the Gaussian linear model, and so it is covered by the general result for 
on-line compression models. But the proof in i j2.1.11 being self-contained and 
elementary and making no reference to conformal prediction, may be especially 
informative for many readers. 

A.l A classical argument for independence 

Recall the definitions we gave in ij3.4l in the case where zi, . . . , z^r are exchange- 
able: An event E is an n-event if its happening or failing is determined by the 
value of Zn and the value of the bag ^zi, . . . , z„_i and an n-event E is e-rare 
if T't{E I '[zi, . . . , z„j) < e. Let us further say that n-event E is exactly e-rare if 



A 



Validity 



Pr(£;|l2i,...,z„j) 



(39) 
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The conditional probability in this equation is a random variable, depending on 
the random bag "[zi, . . . , but the equation says that it is not really random, 
for it is always equal to e. Its expected value, the unconditional probability of 
E, is therefore also equal to e. 

Proposition 2 Suppose £"„ is an exactly e-rare n-event for n — 1, . . . , N . Then 
El, ... , En are mutually independent. 

Proof Consider ^ for n = iV - 1: 

PT{EN-i\lzi,...,ZN-i\)^e. (40) 

Given . . . , zjv-i j, knowledge of zjv does not change the probabilities for 
zjv-i and "[zi, . . . , zn-2^, and zat-i and '[zi, . . . , ZAr_2l determine the {N — 1)- 
event E^-i. So adding knowledge of zjv will not change the probability in (|40p : 

Vy{En-i I l^i, . . . , ZN-i S & zjv) = £• 

Because iJjv is determined by z^ once . . . , z^r-i j is given, it follows that 

Vv{En-i I l^i, . . . , ZN^i S & i^Af) = e, 

and from this it follows that Pr(ii'7v-i I Ejq) = e- The unconditional probability 
of En-i is also e. So iJ^r and E^^i are independent. Continuing the reasoning 
backwards to Ei, we find that the En are all mutually independent. I 

This proof generalizes immediately to the general case of on-line compression 
models (see p. |42|); we simply replace \zi, . . . , Zn] with (t„. 

If N is sufficiently large, and En is an exactly e-rare n-event ioT n — 1, . . . ,N, 
then the law of large numbers applies; with very high probability, no more than 
approximately the fraction e of the N events will happen. It is intuitively clear 
that this conclusion will also hold if we have an inequality instead of an equality 
in (j39p . because making the £'„ even less likely to happen cannot reverse the 
conclusion that few of them will happen. 

The preceding argument is less than rigorous on two counts. First, the proof 
of Proposition[2]does not consider the existence of the conditional probabilities it 
uses. Second, the argument from the case where p9p is an equality to that where 
it is merely an inequality, though entirely convincing, is only intuitive. A fully 
rigorous proof, which uses Doob's measure-theoretic framework to deal with the 
conditional probabilities and uses a randomization to bring the inequality up to 
an equality, is provided on pp. 211-213 of ^28,] . 

A. 2 A game-theoretic law of large numbers 

As we explained in ^3.3\ the game-theoretic interpretation of exchangeabil- 
ity involves a backward-looking protocol, in which Bill observes first the bag 
Izi, . . . , zn'] and then successively zn, zn-i, and so on, finally observing zi. 
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Just before he observes z„, he knows the bag Izi, . . . , Zn\ and can bet on the 
value of Zn at odds corresponding to the probabihtics the bag determines: 

k 

PT{Zn^a\lzi,...,Zn\ = B) = -, (41) 

where k is the number of times a occurs in B. 

The Backward-Looking Betting Protocol 
Players: Joe, Bill 
ICn 1- 

Joe announces a bag B^ of size N. 
FOR n = N,N -1,..., 2,1 

Bill bets on z„ at odds set by ((4T|) . 

Joe announces Zn G -B„. 

ICn-i ■= ICn + Bill's net gain. 

Bn-l ■= Bn \ lZn\- 

Bill's initial capital ICn is 1. His final capital is ICq. 
Given an event E, set 



e := 



1 if _B happens 
if £: fails. 



Given events Ei,. . . , E^ , set 

N 

N ■ 



1 ^ 
Freqjv:=^5]< 



This is the fraction of the events that happen — the frequency with which they 
happen. Our game-theoretic law of large numbers will say that if each En is an 
e-rare n-event, then it is very unlikely that Freq^ will substantially exceed e. 

In game-theoretic probability, what do we mean when we say an event E is 
"very unlikely"? We mean that the bettor. Bill in this protocol, has a betting 
strategy that guarantees 

(C if happens 

[0 if £; fails, ^ ^ 

where C is a large positive number. Cournot's principle, which says that Bill will 
not multiply his initial unit capital by a large factor without risking bankruptcy, 
justifies our thinking E unlikely. The larger C, the more unlikely E. We call 
the quantity 

Bill can guarantee ^ \ (43) 



PE := inf <! 4 



E^s upper probability. An unlikely event is one with small upper probability. 
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Proposition 3 (Game-theoretic weak law of large numbers) Suppose 
En is an e-rare n-event, for n = 1, . . . , A^. Suppose e < 1/2, > 0, 62 > 0, 
51 

P(Freq^ > 6 + ^2) < -^1. 



andN > l/(5i(5|. Then 



In words: If N is sufficiently large, there is a small (less than Si) upper proba- 
bility that the frequency will exceed e substantially (by more than 62)- 

Readers familiar with game-theoretic probability will recognize Proposition[3] 
as a form of the game-theoretic weak law of large numbers stated and proven on 



pp. 124-126 of [24|. The bound it gives for the upper probability of the event 
Freqjv > e + S2 the same as the bound that Chebyshev's inequality gives for 
the probability of this event in classical probability theory when the En are 
independent and all have probability e. 

For the benefit of those not familiar with the concepts used on pp. 124- 
126 of [31 (after being introduced earlier in the book), we conclude with an 
elementary and self-contained proof of Proposition [3l 

Lemma 2 Suppose, for n — 1, . . . , N , that En is an e-rare n-event. Then Bill 
has a strategy that guarantees that his capital /C„ will satisfy 



ICn > 



for n — . . . , N , where i+ :— max(t, 0). 



Proof When n — N, (144)) reduces to ICn > 1, and this certainly holds; Bill's 
initial capital /Cat is equal to 1. So it suffices to show that if ([44]) hold for n, 
then Bill can bet on En in such a way that the corresponding inequality for 
n — 1, 




/C„-i>— - + - > (e,-.) , (45) 



also holds. Here is how Bill bets. 

• If EiL„+i(ej - e) > e, then BiU buys (2/iV) Ejl„+i(ej - e) units of e„. 
By assumption, he pays no more than e for each unit. So we have a lower 
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bound on his net gain, /C„-i — /C„: 

N 

N 



ICn-i - /C„ > ^ I ^ (cj - e) j (e„ - e) 

j=n+l 



, 2 



j=ri+l 

2 



N \ \ ^ ^ ' ' ] ] N 

j=n+l 



(46) 



Adding (gHl) and (gl]), we obtain ((iSl) . 



If X]jLn+i(^j ~ ^) ^' then BiU does not bet at all, and A^„_i — ICn- 
Because 

we again obtain P5|) from (jU]). 

I 

Proof of Proposition [3] The inequality Freq^ > e + (^2 is equivalent to 

> N6l. (47) 



i((s-")T 



Bill's strategy in Lemma [5] does not risk bankruptcy (it is obvious that A^„ > 
for all n when e < 1/2), and says 

^"^^(^(^E(^^-^)j j ■ (48) 

Combining (I47p and (|48p with the assumption that TV > l/(5i^|, we see that 
when the event Freq^ > (- + 62 happens, /Co > l/^i- So by and 
P(Freq^ > 6 + ^2) < <^i. I 

A. 3 The independence of hits for Fisher's interval 

Recall that if zi, . . . , z„, Zn+i are independent normal random variables with 
mean and standard deviation 1, the distribution of the ratio 

'r^\, (49) 
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is called the t- distribution with n degrees of freedom. The upper percentile points 
for this distribution, the points exceeded by (|49)) with probability exactly e, 
are readily available from textbooks and standard computer programs. 
Given a sequence of numbers 21, . . . , z;, where I > 2, we set 



i 

Zl 



1 ' 1 ' _ ^ 

y^z, and sf := j—j'^iz, - ziY 



i=l i=l 

As we recalled in i i2.1.1[ R. A. Fisher proved that if n > 3 and zi, . . . ,Zn are 
independent and normal with a common mean and standard deviation, then the 
ratio tn given by 

£li^Ili^>Zl (50) 

Sn-l V n 

has the i-distribution with n — 2 degrees of freedom [13] ■ It follows that the 
event 



Zn-l ~ tJ-2 •Sn-l \ r ^ 2:„ < Z„_i + tj/o ^n-l A / T (51) 

\ n — 1 y n — 1 

has probability 1 — e. We will now prove that the t„ for successive n are in- 
dependent. This implies that the events ([5T|) for successive values of n are 
independent, so that the law of large numbers applies: with very high proba- 
bility approximately 1 — e of these events will happen. This independence was 
overlooked by Fisher and subsequent authors. 

We begin with two purely arithmetic lemmas, which do not rely on any 
assumption about the probability distribution of zi, . . . , z„. 

Lemma 3 The ratio i„ given by i50\) depends on zi , . . . , z„ only through the 
ratios among themselves of the differences 

Zl z:^, . . . , Zji Zyj. (^2) 

Proof It is straightforward to verify that 

TL 

Zn ^ Zji—i = - (z„ Z,i) (53) 
n — 1 



and 

2 _ {n~ l)sl n{zn - z„)2 



n-2 (n-l)(n-2)' 
Substituting ^ and ^ in ^ produces 



(54) 



^ ^ v/n(n-2)(z„ -z„) ^^^^ 



or 



^ ^ y/n{n - 2){zn - Zn) ^^^^ 

y^{n~l) I]i'=l(^i - Zn)'^ - n{Zn - Z„)2 

The value of ((56)) is unaffected if all the multiplied by a nonzero 

constant. I 
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Lemma 4 Suppose Zn and s„ are known. Then the following three additional 
items of information are equivalent, inasmuch as the other two can be calculated 
from any of the three: 

1. Zn 

2. 'Zn-i and s„_i 

3. tn 

Proof Given 2;„, we can calculate z„_i and s„_i from ([53]) and (I54|) and then 
calculate tn from ([50]) . Given Zn-i and s„-i, we can calculate z„ from (|53p 
or ([54)1 and then tn from ([50|) . Given we can invert (|55|) to find z„ (when 
z„ and s„ are fixed, this equation expresses t„ as a monotonically increasing 
function of z„) and then calculate z„_i and s„_i from (|53p and (|54p . I 

Now we consider probability distributions for zi, . . . , z„. 

Lemma 5 //zi, . . . , z„ are independent and normal with a common mean and 
standard deviation, then conditional onzn = w and —^nf' = the 

vector (zi, . . . , z„) is distributed uniformly over the surface of the n-dimensional 
sphere of radius r centered on the point {w, . . . ,w) in R". 

Proof The logarithm of the joint density of zi, . . . , z„ is 
1 " 

-|iog(2w^)-^5:(z.-^f 

i=l 

1 / " \ 

where n and a are the mean and standard deviation, respectively. Because this 
depends on (zi, . . . , z„) only through z„ and ~ ^n)^, the distribution 

of (zi, . . . , z„) conditional on z„ = w and X]r=i(-^* ~ ^")^ ~ '^^ uniform over 
the set of vectors satisfying these conditions. I 

Lemma 6 // the vector (zi, . . . , z„) is distributed uniformly over the surface of 
the n-dimensional sphere of radius r around (w, . . . ,w) in M", then tn has the 
t- distribution with n — 2 degrees of freedom. 

Proof The distribution of t„ does not depend on vu or r. This is because we 
can transform the uniform distribution over one n-dimensional sphere into a 
uniform distribution over another by adding a constant to all the Zi and then 
multiplying the differences z^ — z„ by a constant, and by Lemma [3l this will not 
change t„. 

Now suppose zi , . . . , z„ are independent and normal with a common mean 
and standard deviation. Lemma [5] says that conditional on z„ ~ w and {n — 
l)Sn = r^, the vector (zi, . . . , z„) is distributed uniformly over the surface of 
the sphere of radius r centered on w, . . . ,w. Since the resulting distribution 
for tn does not depend on w or r, it must be the same as the unconditional 
distribution. I 
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Lemma 7 Suppose (zi, . . . , z„) is distributed uniformly over the surface of the 
N- dimensional sphere of radius r around {w, . . . ,w) in M.^ . Then ^3, . . . , t^r are 
mutually independent. 

Proof It suffices to show that i„ stiU has the i-distribution with n — 2 degrees 
of freedom conditional on . . . ,t]\j. This will imply that t„ is independent 
of tn+i, ■ ■ ■ ,tN and hence that all the t„ are mutually independent. 

We start knowing 'zn and sat. So by Lemma HI learning tn+i, ■ ■ ■ ,tN is 
the same as learning . . . , zat. Geometrically, when we learn zn we in- 

tersect our A^-dimensional sphere in with a hyperplane, reducing it to an 
(A'^ — l)-dimensional sphere in M^^^. (Imagine, for example, intersecting a 3- 
dimensional sphere with a plane: the result is a disc.) When we learn zpf_i^ 
we reduce the dimension again, and so on. In each case, we obtain a uniform 
distribution on the surface of the lower-dimensional sphere for the remaining Zj. 
In the end, we find that (zi, . . . ,z„) is distributed uniformly over the surface 
of an n-dimensional sphere in R", and so tn has the ^-distribution with n — 2 
degrees of freedom by Lemma |6l I 

Proposition 4 Suppose zi, . . . , zn are independent and normal with a common 
mean and standard deviation. Then t^, . . . ,t]\[ are mutually independent. 

Proof By Lemma [71 ts, . . . , tjv are mutually independent conditional on zjv = 
w and sn = r, each t„ having the ^-distribution with n — 2 degrees of freedom. 
Because this joint distribution for t^, . . . ^t^ does not depend on w or r, it is 
also their unconditional joint distribution. I 
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